Next Article in Journal
The Effect of a Topcoat with Amorphous Polymer Layers on the Mesogen Orientation and Photoalignment Behavior of Side Chain Liquid Crystalline Polymer Films
Next Article in Special Issue
A Symmetry Evaluation Method, Using Elevation Angle, for Lower Limb Movement Patterns during Sitting-to-Standing
Previous Article in Journal
An Explainable Artificial Intelligence Approach for Detecting Empathy in Textual Communication
Previous Article in Special Issue
A Smart Modular IoT Sensing Device for Enhancing Sensory Feedbacks in Surgical Robotics
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Multiple Object Tracking in Robotic Applications: Trends and Challenges

1
Electrical, Computer and Biomedical Engineering Department, College of Engineering, Abu Dhabi University, Abu Dhabi 59911, United Arab Emirates
2
Network and Communication Engineering Department, College of Engineering, Al Ain University, Abu Dhabi 112612, United Arab Emirates
3
Mechanical and Industrial Engineering Department, College of Engineering, Abu Dhabi University, Abu Dhabi 59911, United Arab Emirates
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Appl. Sci. 2022, 12(19), 9408; https://doi.org/10.3390/app12199408
Submission received: 18 August 2022 / Revised: 13 September 2022 / Accepted: 14 September 2022 / Published: 20 September 2022
(This article belongs to the Special Issue Trends and Challenges in Robotic Applications)

Abstract

:
The recent advancement in autonomous robotics is directed toward designing a reliable system that can detect and track multiple objects in the surrounding environment for navigation and guidance purposes. This paper aims to survey the recent development in this area and present the latest trends that tackle the challenges of multiple object tracking, such as heavy occlusion, dynamic background, and illumination changes. Our research includes Multiple Object Tracking (MOT) methods incorporating the multiple inputs that can be perceived from sensors such as cameras and Light Detection and Ranging (LIDAR). In addition, a summary of the tracking techniques, such as data association and occlusion handling, is detailed to define the general framework that the literature employs. We also provide an overview of the metrics and the most common benchmark datasets, including Karlsruhe Institute of Technology and Toyota Technological Institute (KITTI), MOTChallenges, and University at Albany DEtection and TRACking (UA-DETRAC), that are used to train and evaluate the performance of MOT. At the end of this paper, we discuss the results gathered from the articles that introduced the methods. Based on our analysis, deep learning has introduced significant value to the MOT techniques in recent research, resulting in high accuracy while maintaining real-time processing.

1. Introduction

Integrating computer vision and deep learning-based systems in the robotics field has led to a massive leap in the advancement of autonomous feature. The utilization of different sensors, such as cameras and LIDARs, and the progress established by the recent research on processing this data, have introduced multiple object tracking techniques in autonomous driving and robotics navigation systems. Multiple object tracking has been one of the most challenging topics researched through computer vision techniques. The reasons behind this are due to: (1) multiple object tracking (MOT), an essential tool that can be used in enhancing security and automating robotics navigation, and (2) occlusion, which is the main obstacle standing in the path of reaching a reliable accuracy and one issue that is difficult to tackle. In this paper, we aim to survey the different approaches of MOT introduced recently in autonomous robotics.
Much research has been done to enhance the performance of tracking in SLAM applications [1,2]. It is simply difficult to navigate through the environment while the positions of the robot itself and other objects in the surrounding are neglected. Tracking is used to estimate the relative location of the robot to other components in the environment. The most challenging part of this process is the existence of highly dynamic objects [3] such as people or vehicles. SLAM-based autonomous navigation in robots has been regarded as essential in development and research primarily because of its potential in many aspects. One example is the autonomous wheelchair systems reviewed in Refs. [4,5]. The authors in Ref. [6] provided a survey of the mobile devices that assist people with disability. Autonomous driving can cause a reduction in the number of accidents that occur due to fatigue and distractions [7]. Although that might be the case, the public opinion about autonomous vehicles is hesitant about whether to consider the technology trustworthy. Providing awareness and understanding of the capabilities of the sensors in autonomous vehicles to the drivers is vital to reaching the proper employment of the technology in our daily lives. These sensors should not be disregarded or become entirely dependable on them [8]. The approach introduced in Ref. [9] aims to reduce the risks firefighters encounter by deploying a team of UAVs with an MOT system to track wildfires and control the situation. The authors in Ref. [10] employ MOT to guide a swarm of drones and control them. Similarly, MOT is utilized with UAV for collision avoidance in Ref. [11]
As has been discussed in Refs. [12,13,14], the general framework for MOT is shown in Figure 1. The input frame is subjected to an object detection algorithm. Then, the detections from the current frame and the previous frames are used to match the similar trajectories either by motion, appearance, and/or other features. This process would generate tracks presenting the objects through the sequence of frames. Some data association between multiple frames is applied to track an object through multiple frames. A reliable MOT system should be able to handle the new tracks as well as the lost ones. Here, the occlusion issue is where the lost tracks reappear again because they did not move out of the sensor’s view but were hidden by other objects.

1.1. Challenges

To effectively perform object tracking, one must develop a robust and efficient model that the users can effectively use. This section aims to provide a comprehensive overview of the various challenges facing developing and optimizing such models.
The first challenge a model must face is the quality of the input video [15]. If the model cannot process the video properly, it will require additional work to convert it into a clear form so that it can be used to detect objects. The classification system must first identify the objects that fall under a particular class. It then gives them IDs based on their shape and form, which raises the issue that objects of this class come in varying shapes and sizes [16]. After the objects are detected, they must be assigned IDs with a bounding box to identify them to ensure that the model can identify multiple similar objects in the coverage area. The next challenge is to identify the objects that are moving in the ROI of the camera. This phenomenon can cause the classification system to misclassify the objects or even identify them as new ones. Aside from the quality of the video input, other factors that affect the classification of objects are also considered. For instance, the illumination conditions can significantly influence the model’s accuracy [12,13,14,16]. The model may not be able to detect objects that are blunt with the environment or have background conditions. It also needs to be able to identify them at varying speeds. One of the most challenging issues in object tracking is the Occlusion issue, where the object movement gets interrupted by other objects in the scene [12,13,14,16]. It can be caused by various factors such as natural conditions or the object’s movement out of the camera’s ROI. Another reason is that other objects might block the visual of the object if the object is in the camera’s ROI. Therefore, the system must be trained to identify and track the objects in motion. It also needs to be able to re-identify the IDs of the captured images with the same ones already used by the cameras. Figure 2 shows an example of the occlusion issue. The yellow arrow follows one of the tracks that maintains its ID after experiencing full occlusion. The problem of occlusion is minimized in bird-eye view tracking [17]. However, other challenges arise, such as the low resolution of objects and misclassifications. Another obstacle is related to onboard tracking in self-driving applications. The issue is that the tracking process needs to be quick and accurate for an efficient assistant driving system. The FPS is one of the essential factors determining the tracking quality in this case [18].

1.2. Related Work

The authors in Ref. [20] provided a summary of the techniques developed for SLAMMOT (combination of SLAM and MOT systems) that utilize the dynamic features to construct 3D object tracking and multi-motion segmentation systems. In Ref. [20], 3D tracking of dynamic objects techniques was categorized into trajectory triangulation, particle filter, and factorization-based approaches. The authors in Ref. [20] discussed the data fusion problem in autonomous vehicles. The perception of data from different sensors such as RGB cameras, LIDAR, and depth cameras provides more knowledge and understanding of the surrounding environment and increases the navigation system’s robustness. In Ref. [20], a survey is conducted on the techniques used for SLAM in autonomous driving applications and the limitations of the current research. The evaluation in this paper was done on the KITTI dataset.
The authors in Ref. [12] categorized the MOT approaches into three groups. The first is the initialization method, which defines whether the tracking would be detection-based or detection-free. Detection-based tracking or tracking-by-detection is the most common, where a detected object is connected to its trajectories from future frames. This connection can be applied by calculating the similarity based on appearance or motion. Detection-free tracking is where a set of objects is manually localized in the first frame and tracked through the future. This is not optimal in case new objects appear and is rarely applied. The second is based on the processing mode, either online or offline tracking. Online tracking is where objects are detected and tracked in real time. This is more optimal in the case of autonomous driving applications as offline tracking is where a batch of frames are processed at a low FPS. The final one is the type of output where it can be stochastic, in which the tracking varies at different running times, or deterministic, in which the tracking is constant. They would further define the components that are included in the MOT system. Appearance model is used to extract spatial information from the detections and then calculate their similarity. The visual features and representations extracted can be defined either locally or regionally.
Motion models are used to predict the future location of the detected object and hence, reduce the inspection area. A good model would have a good estimation after a certain number of frames as its parameters are tuned towards learning how the object moves. Linear (constant velocity) and non-linear are the two types of motion models. Although there has been a rapid advancement in multiple object detection and tracking for autonomous driving, it is still processing only a few objects. It will be a giant leap forward to have a system capable of tracking all types of objects in real time. This can be achieved by generating a great deal of data that tackles the problem at different perceptions such as camera, LIDAR, ultrasonic, etc. [21]. The issues related to the full deployment of MOT in autonomous vehicles lie in that its reliability heavily depends on many parameters, such as the camera view and the type of background (dynamic or static). This leads to difficulty in being entirely trustful towards MOT in different real scenarios and environments [12]. Tracking pedestrians is a far more difficult task than tracking vehicles whose motion is bounded by the road compared to the motion of people, which is very random and challenging for the system to learn. Another issue is the occlusion, which leads to high fragmentation and ID switches due to losing and re-initializing tracks every time they get lost. There have been very few systems that comprehensively tackle the problem, which leaves a huge space for improvements [22].
In Ref. [16], the tracking algorithms are categorized into two groups. The first is matching-based, which defines how features, such as appearance and motion, are first extracted and used to measure the similarity in the future frames. The second is filtering-based tracking, where Kalman and Particle filters are discussed. The authors in Ref. [13] comprehensively surveyed the deep learning-based methods for MOT. They also provided an overview of the data in MOTChallenges and the type of conditions included. An evaluation of the performance of some methods on this dataset is then listed. In Ref. [14], the deep learning-based methods for MOT were also reviewed. Similarly, the authors also provided an overview of the benchmark datasets, including MOTChallenges and KITTI, and presented the performance of some methods.
In Ref. [23], the vision-based methods used to detect and track vehicles at road intersections were discussed. The authors categorized those methods depending on the sensors used and the approach carried out for detection and tracking. On the other hand, the authors in Ref. [24] presented methods introducing vehicle detection and tracking in urban areas, and an evaluation was then discussed. UAVs’ role in civil applications, including surveillance, have been surveyed in Refs. [25,26]. The authors discussed the characteristics and roles of UAVs in traffic flow monitoring. However, there have not been many contributions to vehicle tracking methods using an UAV.
The authors in Refs. [7,8], provided a detailed overview of the types of sensors mounted on autonomous vehicles, such as LIDAR, Ultrasonic, and cameras, for data perception. They also surveyed the current advancement in the autonomous driving field commercially and the type of technology associated with that. In Ref. [21], the authors studied the role of deep learning in autonomous driving including perception and path planning. In addition to deep learning approaches, a general review was introduced in Ref. [27]. In Ref. [28], the methods used to extract and match information from multiple sensors used for perception were reviewed. They also discussed how data association could be an issue in using multiple sensors to achieve reliable multiple object tracking. The authors in Ref. [29], surveyed the methods that utilize LIDAR in data perception and grouped the performance results on the KITTI dataset. Ref. [22] provided a comprehensive overview of the KITTI dataset’s role in the autonomous driving application. The dataset can be used for training and testing pedestrians, vehicles, cyclists, and other objects that can be found on the road. Moreover, the dataset was extended to lane and road marks detection by Ref. [30].
Although the techniques mentioned above were very thorough in reviewing techniques, we aim in this paper to provide comprehensive research that surveys the techniques associated with autonomous robotics applications, provides an insight into the different tracking methods, gathers and compares the results from the different methods discussed in the paper, and evaluate the current work and find limitations that require future research. Table 1 lists the recent reviews, the year of publication, and the datasets used for comparing MOT methods.
Section 2 discusses the state-of-art methods and techniques introduced by the literature. Section 3 discusses the benchmark datasets and evaluation metrics popularly used by the research for training and testing. Section 4 presents the evaluation results collected from the literature and discussion. Finally, Section 5 and Section 6 provide the current study challenges and the future work that is required.

2. Mot Techniques

In this section, we go through the most recent MOT techniques and the common trends being followed for matching tracks across multiple frames.
Table 2 shows a summary of the components used in MOT techniques. It can be observed that the appearance cue is rarely neglected. Motion cue also shows presence a lot. Most approaches depend on deep learning for extracting visual features. CNNs are vital tools that can extract visual features from the tracks and achieve accurate detections of tracks matching [14]. The approaches introduced in Refs. [31,32] use Long Short Term Memory (LSTM) based networks for motion modeling. LSTM networks are considered in MOT for appearance and motion modeling as they can find patterns by efficiently processing the previous frames in addition to the current ones. On the other hand, The authors in Ref. [33] generated histograms from the detections and used them as the appearance features. As for data association, the Hungarian algorithm is common with MOT techniques, such as Refs. [33,34,35,36], for associating the current detections with the previous ones, although the performance of these techniques did not show much potential. Deep learning has rarely been utilized for data association. However, the best performing technique on MOT16 and MOT17 datasets relied on a prediction network to validate that the two bounding boxes are related. For occlusion handling, most approaches rely on feeding the history of tracks into the tracking system to validate the lost ones. The tracks absent for a specific number of frames would be considered lost and deleted from the history. This is to avoid processing a massive number of detections and reducing the FPS.
The general framework illustrated in Figure 1 is followed by most of the recent MOT techniques. The most common approach for extracting the visual features of an object is by using CNN. VGG-16 is very popular for this application, as in Refs. [32,42,48]. The issue with deploying CNN is the slow computation time due to the high dimensionality output. Zhao et al. [36] tackled this issue by applying PCA followed by a correlation filter for dimensionality reduction to the output of the CNN. An encoder with fewer parameters is introduced in Ref. [49] for faster computation. Another popular network for extracting appearance features is ResNet-50, as in Refs. [35,41,43], which resulted in a competing accuracy with fast computation. Peng et al. [43] extracted the appearance features from different layers of a ResNet-50 network forming a Feature Pyramid Network, which has the advantage of detecting objects at different scales. The LSTM network is an important concept for architecture design for processing a sequence of movies. It has been used for MOT application in multiple approaches such as Refs. [31,32]. The main approach taken by most current methods is to store the appearance features of the previous frames and retrieve them for comparison with the ones of the current frame. The important factor that affects the reliability of this comparison is the updating of the stored features. The object’s appearance varies through the frames but not significantly between two adjacent frames; hence, constant updating can lead to a higher matching accuracy.
The second most common feature used for tracking is that related to the object’s motion. This is specifically useful at full occlusion occurrence. In this case, the object’s state can change significantly, and the appearance features will not be reliable for matching. A robust motion model can predict the object’s location even if it disappears from the scene. The most common approach for motion tracking is the Kalman Filter, as in Refs. [50,51,52]. The authors in Refs. [34,45] use the relative position between two tracks in two adjacent frames and decide whether or not the two tracks are of the same object. The authors in Refs. [31,32,46], use deep learning approaches for motion tracking. The most common architecture for this approach is the LSTM network. The Kalman filter has a significantly lower computational cost than the deep learning approaches. The issue with utilizing only motion models for tracking is the random motion of objects. For instance, motion models would work better on cars where the motion is limited than on people. Zhou et al. introduced CenterTrack in Ref. [46] for tracking objects as points. The system is end-to-end, taking the current and the previous frames and outputting the matched tracks as illustrated in Figure 3.
There are other features used for tracking. The authors in Refs. [56,61], added the IoU metric between the adjacent frames’ detections to match the two tracks. The tracking of one object in the scene can be affected by other objects. For this reason, some methods have introduced the interactivity feature to the tracking algorithm. In Ref. [31], the interactivity features were extracted from an occupancy map using an LSTM network. The authors in Ref. [54] used a tracking graph and designed a set of conditions to measure the interactivity between two objects. Figure 4 illustrates the overview of tracking graph methods. The approach in Ref. [63], exploited the size and structure as features along with the appearance and motion for tracking. Increasing the number of features can improve the tracking process at the cost of computation time. The main issue would be the processing needed to fuse those features with different dimensionalities. The authors in Refs. [44,68] used IOU in addition to appearance and motion features to increase the reliability of the tracking. The authors in Ref. [44] further improved the model by adding epipolar constraints with the IOU and introduced a tracklenet to group similar tracks into a cluster. Ref. [69] added the deep_sort algorithm to the extracted features to reduce the unreliable tracks, and Ref. [36] added a correlation filter tracker to the CNN. Ref. [47] performed a similar approach of performing feature extraction and matching simultaneously by having affinity estimation and multi-dimensional assignment in one network. The authors in Refs. [70,71] experimented with 3D distance estimation from RGB frames. In Ref. [70], Poisson multi-Bernoulli mixture tracking filter was used to perform the 3D projections. In addition to CNN, Ref. [72] experimented visually with the track’s Gaussian Mixture Probability Hypothesis Density. The authors in Ref. [37] introduced a motion segmentation framework using motion cues in addition to the IOU and bounding box clusters for object tracking through multiple frames. An interesting technique is established in Ref. [33], where one network has the current and prior frames as inputs and outputs point tracks. Those tracks are given to a displacement model to measure the similarity. Similarly, another approach to motion modeling is introduced in Ref. [45] to handle overlapping tracks by using the efficient quadratic pseudo-Boolean for optimization.
The features extracted from every detection in the current frame must be associated with those extracted from the previous frames. The most popular approach taken for data association in recent years would be the Hungarian algorithm, as in Refs. [50,53,60,63,67]. The advantage of this method is the accuracy accompanied by a fast computation time. Zhang et al. proposed ByteTrack in Ref. [50], where the Kalman filter is used for predicting the detection location followed by two levels of association. The first utilizes the appearance features in addition to the Intersection over Union (IoU) for matching tracks using the Hungarian algorithm. The second level of association deals with the weak detections by utilizing only the IoU with the unmatched tracks remaining from the first level. The authors in Refs. [31,43,52] use deep learning networks for data association. The authors in Ref. [39] introduced a model trained using reinforcement learning. The metric learning concept was used to train the matching model in Ref. [32]. The authors in Ref. [62] take advantage of the object detection network for feature extraction. This would save computational costs from applying an appearance feature network on each detection in the current frame. The approaches in Refs. [43,52] apply the same concept in addition to an end-to-end system that takes the current frame and previous frames as input and outputs the current frame with tracks. The hierarchical single-branch network is an example of an end-to-end system proposed in Ref. [52] as illustrated in Figure 5. The P3AFormer tracker introduced in Ref. [57] uses the simultaneous detection and tracking framework where a decoder and a detector extract pixel-level features from the current and previous frames. The features are passed into a multilayer perceptron (MLP) that outputs the size, center, and class information. The features are then matched using the Hungarian algorithm. The system overview of the P3AFormer is shown in Figure 6.
The Recurrent Autoregressive Networks (RAN) approach introduced in Ref. [38] defines an autoregressive model that is used to estimate the mean and variance of appearance and motion features of all associated tracks of the same object stored in the external memory. The internal memory uses the model generated to compare with all upcoming detections. The one with the maximum score and above a certain threshold would then be considered the same object. The external and internal memory would then be updated with the new associated object. There are two types of independence in this approach. The first is that the motion and appearance models include different parameters, so they have different internal and external memories. The second is a new RAN model generated for every new object detected. The lost tracks are terminated after 20 frames. The visual features are extracted using the fully connected layer (fc8) of the inception network, and the motion feature is a 4-dimensional vector representing the width, height, and relative position from the previous detection. The CTracker framework [43] takes two adjacent frames as inputs and matches each detection with the other using Intersection over Union calculations. A constant velocity motion model is used to match the tracks up to a certain number of frames to handle the reappearance of lost tracks. This approach is end-to-end. It takes two frames as input and outputs two frames with detections and matching tracks.
The authors in Ref. [33] introduced an approach in which the detections’ dissimilarity measures are computed and then matched. First is the dissimilarity cost computation. Next, the histogram of the H and S channels of the HSV colorspace of the previous detections is compared to the similar histogram of the current detections. A grid structure is used as in Refs. [73,74]. Hence, multiple histograms are used to match the appearance features. Furthermore, Linear Binary Pattern Histogram (LBPH), introduced in Ref. [75] and used for object recognition in Ref. [76], is utilized for computing the structure-based distance. The predicted and the measured position matching using the L2 norm is added as the motion-based distance. Finally, IoU calculates the size difference between the current detections and the previous tracks. The second step is using the Hungarian algorithm [77] to calculate the overall similarity using the four features calculated in the previous step.
The authors in Ref. [68] proposed V-IoU, an extension of IoU [78], for object tracking. The objective here is to reduce the number of ID switches and fragmentation by maintaining the location of lost tracks for a certain number of frames until it appears. A backtracking technique where the reappeared track is projected backward through the frames is implemented to validate that the reappeared track is, in fact, the lost one. In Ref. [46], CenterNet [79] is used to detect objects as points. Centertrack takes two adjacent frames as inputs in addition to the point detections in the previous frame. The tracks are associated using the offset between the current and previous point detections. The authors in Ref. [80] designed a motion segmentation model where the point clusters used for trajectory prediction were placed around the center of the detected bounding box. The approach in Ref. [37] employs optical flow and correlation co-clustering for projecting trajectory points across multiple frames, as illustrated in Figure 7.
There has been a recent advancement in multiple object tracking and segmentation (MOTS). This field tackles the issues related to the classic MOT, which are associated with the utilization of bounding box detection and tracking, such as background noises and loss of the shape features. The approach introduced in Ref. [81] used an instance segmentation mask on the extracted embedding. The method in Ref. [82] applies contrastive learning for learning the instance-masks used for segmentation. An offline approach introduced in Ref. [83] exploits appearance features for tracking. This method is currently at the top of the leader board at the MOTS20 challenge [84].
The system can achieve more reliability when utilizing multiple sensors to detect and track targets in addition to understanding their intentions [28]. The deep learning approaches are improving and showing promise in the LIDAR datasets. The issue with this is the bad running time, causing difficulty in real-time deployment [29]. The challenges facing 3D tracking are related to the fusion of the data perceived from LIDAR and RGB cameras. Table 3 lists the recent MOT techniques that utilize LIDAR and camera for tracking.
Simon et al. [85] proposed complexer-YOLO, illustrated in Figure 8, for RGB and LIDAR data detection, tracking, and segmentation. A preprocessing step for the point cloud data input from LIDAR aims to generate a voxalized map of the 3D detections. The RGB frame is passed into ENet [91], which will output a semantic map. Both maps are matched and passed into the Complexer-YOLO network to output tracks. The approach in Ref. [86] extracts features from the RGB frame and the point cloud data. An end-to-end approach was introduced in Ref. [87] for dealing with features extraction and fusing from RGB and Point Cloud Data, as illustrated in Figure 9. Point-wise convolution in addition to a start and end estimator are utilized for fusing both types of data to be used for tracking.

3. Mot Benchmark Datasets and Evaluation Metrics

In this section, we review the most common datasets that are used for training and testing MOT techniques. We also provide an overview of the metrics used to evaluate the performance of these techniques.

3.1. Benchmark Datasets

Most research done in multiple object tracking uses standard datasets for evaluating the state of art techniques. In this way, we have a better view of what criteria, the new methodologies, have shown superiority. For this application, everyday moving objects could be pedestrians, vehicles, cyclists, etc. The most common datasets that provide a variation of those objects in the streets are the MOTChallenge collection and KITTI.
  • MOTChallenge: The most common datasets in this collection are the MOT15 [19], MOT16 [92], MOT17 [92], and MOT20 [93]. There is a newly created set, MOT20, but it has not yet become a standard for evaluation in the research community to our current knowledge. The MOT datasets contain some data from existing sets such as PETS and TownCenter and others that are unique. Examples of the data included are presented in Table 4, where the amount of variation included in the MOT15 and MOT16 can be observed. Thus, the dataset is useful for training and testing using static and dynamic backgrounds and for 2D and 3D tracking. An evaluation tool is also given with the set to measure all features of the multiple object tracking algorithm, including accuracy, precision, and FPS. The ground truth data samples are shown in Figure 10.
  • KITTI [94]: This dataset is created specifically for autonomous driving. It was collected by a car driven through the streets with multiple sensors mounted for data collection. The set includes PointCloud data collected using LIDAR sensors and RGB video sequences captured by monocular cameras. It has been included in multiple research related to 2D and 3D multiple object tracking. Samples of the pointcloud and RGB data included in the KITTI dataset are shown in Figure 11.
  • UA-DETRAC [96,97,98]: The dataset includes videos sequences captured from static cameras looking at the streets at different cities. A huge amount of labeled vehicles can assist in training and testing for static background multiple object tracking in surveillance and autonomous driving. Samples of the UA-DETRAC dataset at different illumination conditions can be shown in Figure 12.

3.2. Evaluation Metrics

The most common evaluation metrics are the CLEAR MOT metrics, which were developed by Refs. [22,23]. Mostly tracked objects (MT) and mostly lost objects (ML) in addition to IDF1 are uses to present the leaderboards in MotChallenges. False Positives (FP) is the number of falsely detected objects. false Negatives (FN) the number of falsely undetected objects. Fragmentation (Fragm) is the number of times a track gets interrupted. ID Switches (IDSW) is the number of times an ID changes. Multiple Object Tracking Accuracy (MOTA) is given by (1), whereas Multiple Object tracking Precision (MOTP) is given by (2). Finally, frames per second (Hz) and IDF1 given by (3).
M O T A = 1 ( F N + F P + I D S W ) G T
where GT is the total number of ground truth labels.
M O T P = t , i d t , i t c t
where c is the number of matches in t frame and d is the correctly overlapping detections and tracks.
I d e n t i f i c a t i o n F 1 ( I D F 1 ) = 1 1 I D P + 1 I D R
where identification precision is defined by:
I D P = I D T P I D T P + I D F P
and identification recall is defined by:
I D R = I D T P I D T P + I D F N
IDTP is the sum of true positives edges weights, IDFP is the sum of false positives edges weights, and IDFN is the sum of false negatives edges weights
The metrics used for evaluating on the UA_DETRAC dataset utilize the precision recall (PR) for calculating the CLEAR metrics, as introduced in Ref. [96]. In addition to these metrics, the HOTA metric introduced in Ref. [99] is calculated by the formula in (6).
0 < α 1 H O T A α
where
c ϵ T P α ( A s s _ I o U ) α ( c ) ) | T P α | + | F N α | + | F P α |
where
A s s _ I o U = | T P A | | T P A | + | F N A | + | F P A |
where TPA, FNA, and FPA are the association metrics.
There has been a recent advancement in the multiple object tracking and segmentation (MOTS). This field tackles the issues related to the classic MOT which are associated with the utilization of bounding box detection and tracking such as background noises and loss of the shape features. The MOTS20 Challenge [84] proposed metrics for evaluating methods that tackle this issue. The multi-object tracking and segmentation accuracy (MOTSA) is calculated using the formula in (9). Similarly, multi-object tracking and segmentation precision (MOTSP) and soft multi-object tracking and segmentation accuracy (sMOTSA) are found by the formulas in (10) and (11), respectively.
M O T S A = 1 ( | F N | + | F P | + | I D S W | ) | M |
where M is the ground truth masks.
M O T S P = T P ˜ | T P |
where T P ˜ is the soft version true positives (TP).
s M O T S A = T P ˜ | F P | | I D S W | | M |

4. Evaluation and Discussion

In this section, we compare the MOT techniques based on the dataset used for evaluation. Then, analysis and discussion are conducted to provide insight for future work.
The performance of the most recent MOT techniques on MOT15, 16, 17, and 20 datasets are shown in Table 5. The ↑ stands for higher is better, and ↓ stands for lower is better. The protocol indicates the type of detector used for evaluating the results. The dataset provides the public detector and is typical for all methods. The private detector is designed by the method and is not shared. In MOT15, the tracker introduced in Ref. [38] has the highest accuracy and the lowest identity switches (IDs). It also maintained the highest percentage of tracks (MT) and has the lowest percentage of lost ones (ML). On the other hand, it had a significantly higher number of false positives and negatives compared to the method introduced in Ref. [42], which also performed with the highest FPS (Hz). The authors in Ref. [36] evaluated their system using the private detector protocol and have significantly lower fragmentation than all other methods. In Ref. [38], the tracker relied on appearance features extracted from the inception network layer and the position of the detections. The association process was done using conditional probability. The approach in Ref. [48] has the second best accuracy, where the motion features and the appearance features were used, as well as a category classifier for the association. The method with the lowest accuracy [80] only relied on the motion features, and the slowest performing method [39], where reinforcement learning was applied for data association. The approach in Ref. [62] has the highest accuracy on the MOT16 dataset using the private detector protocol. The method in Ref. [53] also used the private detector and has the highest HOTA. Both methods relied on appearance features for tracking and incorporated the Hungarian algorithm for matching. Similarly, a significantly faster-performing method [38] with slightly lower accuracy only used the appearance features and a prediction network for the association. The method in Ref. [35] used the Kalman filter for motion feature prediction and the appearance features extracted from the detection network for tracking. This method has a lower accuracy in comparison to other methods. In MOT17, the approach with the highest accuracy [57] has a significantly higher fragmentation than the one introduced in Ref. [64], which only used the motion feature for tracking. The method in Ref. [67] has slightly lower accuracy but with an acceptable FPS. This method used the appearance and motion features in addition to the Hungarian algorithm for matching. The approaches in Refs. [56,58] have an acceptable accuracy where both used Kalman filter for motion tracking and Ref. [56] neglected the appearance features. The method in Ref. [64] has the lowest number of fragmentation and only uses the motion features in tracking. The method in Ref. [57] has the highest accuracy on MOT20, although it has a significantly higher number of fragmentation compared to the one in Ref. [54]. The approach in Ref. [67] has the highest FPS, followed by the one in Ref. [54]. All of these methods relied on visual and motion features for tracking. The methods in Refs. [58,67] only used the motion features and had an acceptable accuracy. On the other hand, the ones that only relied on the visual features, such as Refs. [60,62,63,65] did not perform well according to the accuracy and other metrics. This evaluation shows that appearance features are essential for high accuracy, and other cues are used to boost performance. Based on the results from Table 5 and the summary presented in Table 2, the utilization of deep learning for data association reduces the processing time, as can be observed from Refs. [43,49]. On the other hand, including motion cues in the system drops the FPS significantly compared to only using the visual features as indicated by the results in Ref. [32]. Although adding complexity to the system drops the FPS, the IDS metric significantly improves when the motion features are included in the system. It can be concluded from these findings that to improve the FPS and the accuracy one should use deep learning in all MOT components. Although that might be the case, the end-to-end approaches introduced in Refs. [32,48] have used deep learning to extract appearance and motion features and data association, and the performance did not compete with other approaches. Deep learning approaches are data-driven, which means they are suitable for specific tasks but unsuitable or expected to perform poorly in real scenarios due to the variation from the data used in training [13].
Similarly, the performance of the KITTI dataset is shown in Table 6. The dataset is divided into multiple sequences: car, pedestrian, and cyclist. The methods are either evaluated on all of them combined or individually. As it can be observed, the pedestrian data are the most difficult to process with a good performance. Only methods evaluated on all sequences are used for comparison, and accordingly, the values corresponding to the best performance according to the metric are in bold. The rest are listed for observation and analysis. The approach introduced in Ref. [46] showed superiority. Although it did not perform well on the MOT15 dataset, it showed a competing performance on the MOT17 dataset. The authors in Ref. [102] evaluated the proposed technique on each sequence individually. They have superior performance on all of them. The performance on the UA_DETRAC dataset is shown in Table 7. Although the UA_DETRAC dataset is of a static background type and does not include the challenge of dynamic background, the approach introduced in Ref. [47] performed better on the KITTI dataset. This variation in the performance of the MOT techniques on multiple datasets may indicate that the MOT techniques are data driven and difficult to generalize. Similar to 2D Tracking, the deep learning approach is utilized for visual feature extraction in 3D. For processing point cloud data, PointNet is the most popular method. The authors in Refs. [85,89] did not rely on deep learning for techniques for the processing of point cloud data and performed poorly in terms of accuracy on the KITTI dataset shown in Table 6. The approach with the highest accuracy in Table 3 introduced in Ref. [86] creates multiple solutions for data fusion problems. The features extracted from LIDAR and camera sensors are fused by concatenation, addition, or weighted addition and then passed into a custom-designed network to calculate the correlation between the features and output the linked detections. The approaches that depend on deep learning for data fusion, Refs. [85,87,90], have a low MOTA, although [90] has high accuracy on the car set. The localization-based Tracking introduced in Ref. [103] has the best accuracy on the UA_DETRAC dataset. Although the DMM-Net method in Ref. [104] has significantly lower accuracy, it showed superiority in identity switches and fragmentation.
Navigation and self-driving applications in robotics depend on the online feature. The system must be able to react in real time. Although most of the techniques discussed can process video sequences online, their FPS is not showing robust performance, according to the Hz metric, to be deployed in an application such as self-driving. The method introduced in Ref. [43] has the highest FPS overall and acceptable accuracy on the MOT16 and MOT17 datasets. However, the research utilizing LIDAR and RGB cameras show potential in robotics navigation and autonomous driving applications.

5. Current Research Challenges

Through this study, we gained insight into the current trends of online MOT methods that can be utilized in robotics applications and the challenges faced. The first challenge would be the online feature. The MOT algorithm should be able to operate in real-time for most robotics applications in order to be able to react to environmental change. The second challenge would be the accurate track trajectory across multiple frames. The lack of this problem can cause multiple identity switches and difficulty keeping a concise description of the surroundings. In addition, the issue is segmenting the detections at pixel level and tracking them. The bounding box provides wrong information about the object in shape and size, along with noises from the background. Finally, the motion feature has proven its value in tracking, but it is not simple to track random moving objects such as people, animals, cyclists, etc.

6. Future Work

The final objective of the research done on deploying MOT algorithms in autonomous robots is to have a reliable system that contributes to reducing accidents and facilitating tasks that might be difficult for humans to carry out. One aspect that we found the current research lacks is the generation of a new benchmark dataset that includes data collected by the standard sensors employed by the current industry. Sensors such as ultrasonic and LiDar are essential in today’s autonomous robot manufacturing, and it is necessary to use the same tools to make the research on MOT up-to-date. Moreover, using deep learning algorithms for detection and tracking would face a massive problem due to the risk of meeting a variation that was not included in the training set. Thus, deep learning should be trained on segmenting the road regions and hence, be trained on any area before deployment. This is one of the approaches that can be researched to tackle the problem of dealing with new objects. The current research looks at appearance and motion models as necessary in MOT. They are further going into learning the behavior of objects in the scenes and the interactivity between those objects. For instance, two objects moving towards each other would lead to one of them getting covered, and a track would be lost. As the MOT system’s complexity increases, it becomes more challenging to work in real-time. The research on embedded processors that can be utilized in autonomous robots will significantly contribute to increasing the accuracy while maintaining the online feature.

7. Conclusions

This paper aims to review the current trends and challenges related to the MOT for autonomous robotics applications. This area of study has been frequently researched recently due to its high potential and standards, which are difficult to achieve. The paper has discussed and compared the MOT techniques through a common framework and datasets, including MOTChallenges, KITTI, and UA_DETRAC. There is a vast area left to explore and investigate as well as multiple approaches created by the literature that has the potential to build into reliable and robust techniques. A summary of the components utilized in the general MOT framework, including appearance and motion cues, data association, and occlusion handling, has been listed and studied. In addition, the popular methods used for data fusion between multiple sensors, focusing on the camera and LIDAR, have been reviewed. The role that deep learning techniques are utilized in MOT approaches has been investigated thoroughly using quantitative analysis to evaluate its limitations and strong points.

Author Contributions

Conceptualization, M.G., T.B. and A.G.; methodology, M.G., T.B. and A.G.; software, M.G., T.B. and A.G.; validation, M.Y., H.A. and M.A.; formal analysis, M.G., T.B. and A.G.; investigation, M.G., T.B. and A.G.; resources, M.G. and M.A.; data curation, M.G., T.B. and A.G.; writing—original draft preparation, T.B., A.G. and H.A.; writing—review and editing, M.G., T.B., A.G., M.Y., H.A. and M.A.; visualization, M.G., T.B., A.G. and M.A.; supervision, M.G. and M.A.; project administration, M.G. and M.A.; funding acquisition, M.A. All authors have read and agreed to the published version of the manuscript.

Funding

This paper is part of a research project funded by the Office of Research and Sponsored Programs (ORSP) at Abu Dhabi University through a research fund (Grant number 19300564).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
MOTMultiple Object Tracking
LIDARLight Detection and Ranging
KITTIKarlsruhe Institute of Technology and Toyota Technological Institute
UA-DETRACUniversity at Albany DEtection and TRACking
SLAMSimultaneous localization and mapping
ROIRegion of Interest
IoUIntersection over Union
HOTAHigher Order Tracking Accuracy
SLAMMOTSimultaneous localization and mapping Multiple Object Tracking
RTURecurrent Tracking Unit
LSTMLong Short Term Memory
ECOEfficient Convolution Operators
PCAPrincipal Component Analysis
LDAELightweight and Deep Appearance Embedding
DLADeep Layer Aggregation
GCDGlobal Context Disentangling
GTEGuided Transformer Encoder
DETRDEtection TRansformer
PCBPart-based Convolutional Baseline

References

  1. Runz, M.; Buffier, M.; Agapito, L. MaskFusion: Real-Time Recognition, Tracking and Reconstruction of Multiple Moving Objects. In Proceedings of the 2018 IEEE International Symposium on Mixed and Augmented Reality, ISMAR 2018, Munich, Germany, 16–20 October 2019; pp. 10–20. [Google Scholar]
  2. Zhou, Q.; Zhao, S.; Li, H.; Lu, R.; Wu, C. Adaptive Neural Network Tracking Control for Robotic Manipulators with Dead Zone. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 3611–3620. [Google Scholar] [CrossRef] [PubMed]
  3. Yu, C.; Liu, Z.; Liu, X.; Xie, F.; Yang, Y.; Wei, Q.; Fei, Q. DS-SLAM: A Semantic Visual SLAM towards Dynamic Environments. In Proceedings of the IEEE International Conference on Intelligent Robots and Systems, Madrid, Spain, 1–5 October 2018; pp. 1168–1174. [Google Scholar]
  4. Fehr, L.; Langbein, W.E.; Skaar, S.B. Adequacy of power wheelchair control interfaces for persons with severe disabilities: A clinical survey. J. Rehabil. Res. Dev. 2000, 37, 353–360. [Google Scholar]
  5. Simpson, R. Smart wheelchairs: A literature review. J. Rehabil. Res. Dev. 2005, 42, 423–436. [Google Scholar] [CrossRef]
  6. Martins, M.M.; Santos, C.P.; Frizera-Neto, A.; Ceres, R. Assistive mobility devices focusing on Smart Walkers: Classification and review. Robot. Auton. Syst. 2012, 60, 548–562. [Google Scholar] [CrossRef]
  7. Khan, M.Q.; Lee, S. A comprehensive survey of driving monitoring and assistance systems. Sensors 2019, 19, 2574. [Google Scholar] [CrossRef] [PubMed]
  8. Van Brummelen, J.; O’Brien, M.; Gruyer, D.; Najjaran, H. Autonomous vehicle perception: The technology of today and tomorrow. Transp. Res. Part C Emerg. Technol. 2018, 89, 384–406. [Google Scholar] [CrossRef]
  9. Pham, H.X.; La, H.M.; Feil-Seifer, D.; Deans, M.C. A distributed control framework of multiple unmanned aerial vehicles for dynamic wildfire tracking. IEEE Trans. Syst. Man Cybern. Syst. 2020, 50, 1537–1548. [Google Scholar] [CrossRef]
  10. Cai, Z.; Wang, L.; Zhao, J.; Wu, K.; Wang, Y. Virtual target guidance-based distributed model predictive control for formation control of multiple UAVs. Chin. J. Aeronaut. 2020, 33, 1037–1056. [Google Scholar] [CrossRef]
  11. Huang, Y.; Liu, W.; Li, B.; Yang, Y.; Xiao, B. Finite-time formation tracking control with collision avoidance for quadrotor UAVs. J. Frankl. Inst. 2020, 357, 4034–4058. [Google Scholar] [CrossRef]
  12. Luo, W.; Xing, J.; Milan, A.; Zhang, X.; Liu, W.; Kim, T. Multiple object tracking: A literature review. Artif. Intell. 2021, 293, 103448. [Google Scholar] [CrossRef]
  13. Ciaparrone, G.; Luque Sánchez, F.; Tabik, S.; Troiano, L.; Tagliaferri, R.; Herrera, F. Deep learning in video multi-object tracking: A survey. Neurocomputing 2020, 381, 61–88. [Google Scholar] [CrossRef] [Green Version]
  14. Xu, Y.; Zhou, X.; Chen, S.; Li, F. Deep learning for multiple object tracking: A survey. IET Comput. Vis. 2019, 13, 411–419. [Google Scholar] [CrossRef]
  15. Irvine, J.M.; Wood, R.J.; Reed, D.; Lepanto, J. Video image quality analysis for enhancing tracker performance. In Proceedings of the 2013 IEEE Applied Imagery Pattern Recognition Workshop (AIPR), Washington, DC, USA, 23–25 October 2013; pp. 1–9. [Google Scholar] [CrossRef]
  16. Meng, L.; Yang, X. A Survey of Object Tracking Algorithms. Zidonghua Xuebao/Acta Autom. Sin. 2019, 45, 1244–1260. [Google Scholar]
  17. Jain, V.; Wu, Q.; Grover, S.; Sidana, K.; Chaudhary, D.G.; Myint, S.; Hua, Q. Generating Bird’s Eye View from Egocentric RGB Videos. Wirel. Commun. Mob. Comput. 2021, 2021, 7479473. [Google Scholar] [CrossRef]
  18. Yeong, D.J.; Velasco-Hernandez, G.; Barry, J.; Walsh, J. Sensor and Sensor Fusion Technology in Autonomous Vehicles: A Review. Sensors 2021, 21, 2140. [Google Scholar] [CrossRef]
  19. Leal-Taixé, L.; Milan, A.; Reid, I.; Roth, S.; Schindler, K. MOTChallenge 2015: Towards a Benchmark for Multi-Target Tracking. arXiv 2015, arXiv:1504.01942. [Google Scholar]
  20. Xu, Z.; Rong, Z.; Wu, Y. A survey: Which features are required for dynamic visual simultaneous localization and mapping? Vis. Comput. Ind. Biomed. Art 2021, 4, 20. [Google Scholar] [CrossRef]
  21. Grigorescu, S.; Trasnea, B.; Cocias, T.; Macesanu, G. A survey of deep learning techniques for autonomous driving. J. Field Robot. 2020, 37, 362–386. [Google Scholar] [CrossRef]
  22. Janai, J.; Güney, F.; Behl, A.; Geiger, A. Computer vision for autonomous vehicles. Found. Trends Comput. Graph. Vis. 2020, 12, 1–308. [Google Scholar] [CrossRef]
  23. Datondji, S.R.E.; Dupuis, Y.; Subirats, P.; Vasseur, P. A Survey of Vision-Based Traffic Monitoring of Road Intersections. IEEE Trans. Intell. Transp. Syst. 2016, 17, 2681–2698. [Google Scholar] [CrossRef]
  24. Buch, N.; Velastin, S.A.; Orwell, J. A review of computer vision techniques for the analysis of urban traffic. IEEE Trans. Intell. Transp. Syst. 2011, 12, 920–939. [Google Scholar] [CrossRef]
  25. Otto, A.; Agatz, N.; Campbell, J.; Golden, B.; Pesch, E. Optimization approaches for civil applications of unmanned aerial vehicles (UAVs) or aerial drones: A survey. Networks 2018, 72, 411–458. [Google Scholar] [CrossRef]
  26. Shakhatreh, H.; Sawalmeh, A.H.; Al-Fuqaha, A.; Dou, Z.; Almaita, E.; Khalil, I.; Othman, N.S.; Khreishah, A.; Guizani, M. Unmanned Aerial Vehicles (UAVs): A Survey on Civil Applications and Key Research Challenges. IEEE Access 2019, 7, 48572–48634. [Google Scholar] [CrossRef]
  27. Badue, C.; Guidolini, R.; Carneiro, R.V.; Azevedo, P.; Cardoso, V.B.; Forechi, A.; Jesus, L.; Berriel, R.; Paixão, T.M.; Mutz, F.; et al. Self-driving cars: A survey. Expert Syst. Appl. 2021, 165, 113816. [Google Scholar] [CrossRef]
  28. Wang, Z.; Wu, Y.; Niu, Q. Multi-Sensor Fusion in Automated Driving: A Survey. IEEE Access 2020, 8, 2847–2868. [Google Scholar] [CrossRef]
  29. Elhousni, M.; Huang, X. A Survey on 3D LiDAR Localization for Autonomous Vehicles. In Proceedings of the IEEE Intelligent Vehicles Symposium, Las Vegas, NV, USA, 9 October–13 November 2020; pp. 1879–1884. [Google Scholar]
  30. Fritsch, J.; Kühnl, T.; Geiger, A. A new performance measure and evaluation benchmark for road detection algorithms. In Proceedings of the 16th International IEEE Conference on Intelligent Transportation Systems (ITSC 2013), The Hague, The Netherlands, 6–9 October 2013; pp. 1693–1700. [Google Scholar] [CrossRef]
  31. Sadeghian, A.; Alahi, A.; Savarese, S. Tracking the Untrackable: Learning to Track Multiple Cues with Long-Term Dependencies. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 300–311. [Google Scholar] [CrossRef]
  32. Xiang, J.; Zhang, G.; Hou, J. Online Multi-Object Tracking Based on Feature Representation and Bayesian Filtering Within a Deep Learning Architecture. IEEE Access 2019, 7, 27923–27935. [Google Scholar] [CrossRef]
  33. Karunasekera, H.; Wang, H.; Zhang, H. Multiple Object Tracking With Attention to Appearance, Structure, Motion and Size. IEEE Access 2019, 7, 104423–104434. [Google Scholar] [CrossRef]
  34. Mahmoudi, N.; Ahadi, S.M.; Mohammad, R. Multi-target tracking using CNN-based features: CNNMTT. Multimed. Tools Appl. 2019, 78, 7077–7096. [Google Scholar] [CrossRef]
  35. Zhou, Q.; Zhong, B.; Zhang, Y.; Li, J.; Fu, Y. Deep Alignment Network Based Multi-Person Tracking With Occlusion and Motion Reasoning. IEEE Trans. Multimed. 2019, 21, 1183–1194. [Google Scholar] [CrossRef]
  36. Zhao, D.; Fu, H.; Xiao, L.; Wu, T.; Dai, B. Multi-Object Tracking with Correlation Filter for Autonomous Vehicle. Sensors 2018, 18, 2004. [Google Scholar] [CrossRef] [Green Version]
  37. Keuper, M.; Tang, S.; Andres, B.; Brox, T.; Schiele, B. Motion Segmentation & Multiple Object Tracking by Correlation Co-Clustering. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 140–153. [Google Scholar] [CrossRef]
  38. Fang, K.; Xiang, Y.; Li, X.; Savarese, S. Recurrent Autoregressive Networks for Online Multi-Object Tracking. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018. [Google Scholar] [CrossRef]
  39. Chu, P.; Fan, H.; Tan, C.C.; Ling, H. Online Multi-Object Tracking with Instance-Aware Tracker and Dynamic Model Refreshment. In Proceedings of the 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 7–11 January 2019. [Google Scholar] [CrossRef]
  40. Zhu, J.; Yang, H.; Liu, N.; Kim, M.; Zhang, W.; Yang, M.H. Online Multi-Object Tracking with Dual Matching Attention Networks. arXiv 2019, arXiv:1902.00749. [Google Scholar]
  41. Zhou, Z.; Xing, J.; Zhang, M.; Hu, W. Online Multi-Target Tracking with Tensor-Based High-Order Graph Matching. In Proceedings of the 2018 24th International Conference on Pattern Recognition (ICPR), Beijing, China, 20–24 August 2018; pp. 1809–1814. [Google Scholar] [CrossRef]
  42. Sun, S.; Akhtar, N.; Song, H.; Mian, A.; Shah, M. Deep Affinity Network for Multiple Object Tracking. arXiv 2018, arXiv:1810.11780. [Google Scholar] [CrossRef]
  43. Peng, J.; Wang, C.; Wan, F.; Wu, Y.; Wang, Y.; Tai, Y.; Wang, C.; Li, J.; Huang, F.; Fu, Y. Chained-Tracker: Chaining Paired Attentive Regression Results for End-to-End Joint Multiple-Object Detection and Tracking. arXiv 2020, arXiv:2007.14557. [Google Scholar]
  44. Wang, G.; Wang, Y.; Zhang, H.; Gu, R.; Hwang, J.N. Exploit the Connectivity: Multi-Object Tracking with TrackletNet. arXiv 2018, arXiv:1811.07258. [Google Scholar]
  45. Lan, L.; Wang, X.; Zhang, S.; Tao, D.; Gao, W.; Huang, T.S. Interacting Tracklets for Multi-Object Tracking. IEEE Trans. Image Process. 2018, 27, 4585–4597. [Google Scholar] [CrossRef] [PubMed]
  46. Zhou, X.; Koltun, V.; Krähenbühl, P. Tracking Objects as Points. arXiv 2020, arXiv:2004.01177. [Google Scholar]
  47. Chu, P.; Ling, H. FAMNet: Joint Learning of Feature, Affinity and Multi-dimensional Assignment for Online Multiple Object Tracking. arXiv 2019, arXiv:1904.04989. [Google Scholar]
  48. Chen, L.; Ai, H.; Shang, C.; Zhuang, Z.; Bai, B. Online multi-object tracking with convolutional neural networks. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 645–649. [Google Scholar] [CrossRef]
  49. Yoon, K.; Kim, D.Y.; Yoon, Y.C.; Jeon, M. Data Association for Multi-Object Tracking via Deep Neural Networks. Sensors 2019, 19, 559. [Google Scholar] [CrossRef]
  50. Xu, B.; Liang, D.; Li, L.; Quan, R.; Zhang, M. An Effectively Finite-Tailed Updating for Multiple Object Tracking in Crowd Scenes. Appl. Sci. 2022, 12, 1061. [Google Scholar] [CrossRef]
  51. Ye, L.; Li, W.; Zheng, L.; Zeng, Y. Lightweight and Deep Appearance Embedding for Multiple Object Tracking. IET Comput. Vis. 2022, 16, 489–503. [Google Scholar] [CrossRef]
  52. Wang, F.; Luo, L.; Zhu, E.; Wang, S.; Long, J. Multi-object Tracking with a Hierarchical Single-branch Network. CoRR 2021, abs/2101.01984. Available online: http://xxx.lanl.gov/abs/2101.01984 (accessed on 7 September 2022).
  53. Yu, E.; Li, Z.; Han, S.; Wang, H. RelationTrack: Relation-aware Multiple Object Tracking with Decoupled Representation. CoRR 2021, abs/2105.04322. Available online: http://xxx.lanl.gov/abs/2105.04322 (accessed on 7 September 2022).
  54. Wang, S.; Sheng, H.; Yang, D.; Zhang, Y.; Wu, Y.; Wang, S. Extendable Multiple Nodes Recurrent Tracking Framework with RTU++. IEEE Trans. Image Process. 2022, 31, 5257–5271. [Google Scholar] [CrossRef] [PubMed]
  55. Gao, Y.; Gu, X.; Gao, Q.; Hou, R.; Hou, Y. TdmTracker: Multi-Object Tracker Guided by Trajectory Distribution Map. Electronics 2022, 11, 1010. [Google Scholar] [CrossRef]
  56. Nasseri, M.H.; Babaee, M.; Moradi, H.; Hosseini, R. Fast Online and Relational Tracking. arXiv 2022, arXiv:2208.03659. [Google Scholar]
  57. Zhao, Z.; Wu, Z.; Zhuang, Y.; Li, B.; Jia, J. Tracking Objects as Pixel-wise Distributions. arXiv 2022, arXiv:2207.05518. [Google Scholar]
  58. Aharon, N.; Orfaig, R.; Bobrovsky, B.Z. BoT-SORT: Robust Associations Multi-Pedestrian Tracking. arXiv 2022, arXiv:2206.14651. [Google Scholar]
  59. Seidenschwarz, J.; Brasó, G.; Elezi, I.; Leal-Taixé, L. Simple Cues Lead to a Strong Multi-Object Tracker. arXiv 2022, arXiv:2206.04656. [Google Scholar]
  60. Dai, P.; Feng, Y.; Weng, R.; Zhang, C. Joint Spatial-Temporal and Appearance Modeling with Transformer for Multiple Object Tracking. arXiv 2022, arXiv:2205.15495. [Google Scholar]
  61. Zhang, Y.; Wang, C.; Wang, X.; Zeng, W.; Liu, W. Robust Multi-Object Tracking by Marginal Inference. arXiv 2022, arXiv:2208.03727. [Google Scholar]
  62. Hyun, J.; Kang, M.; Wee, D.; Yeung, D.Y. Detection Recovery in Online Multi-Object Tracking with Sparse Graph Tracker. arXiv 2022, arXiv:2205.00968. [Google Scholar]
  63. Chen, M.; Liao, Y.; Liu, S.; Wang, F.; Hwang, J.N. TR-MOT: Multi-Object Tracking by Reference. arXiv 2022, arXiv:2203.16621. [Google Scholar]
  64. Cao, J.; Weng, X.; Khirodkar, R.; Pang, J.; Kitani, K. Observation-Centric SORT: Rethinking SORT for Robust Multi-Object Tracking. arXiv 2022, arXiv:2203.1662. [Google Scholar]
  65. Wan, J.; Zhang, H.; Zhang, J.; Ding, Y.; Yang, Y.; Li, Y.; Li, X. DSRRTracker: Dynamic Search Region Refinement for Attention-based Siamese Multi-Object Tracking. arXiv 2022, arXiv:2203.10729. [Google Scholar]
  66. Du, Y.; Song, Y.; Yang, B.; Zhao, Y. StrongSORT: Make DeepSORT Great Again. arXiv 2022, arXiv:2202.13514. [Google Scholar]
  67. Zhang, Y.; Sun, P.; Jiang, Y.; Yu, D.; Weng, F.; Yuan, Z.; Luo, P.; Liu, W.; Wang, X. ByteTrack: Multi-Object Tracking by Associating Every Detection Box. arXiv 2021, arXiv:2110.06864. [Google Scholar]
  68. Bochinski, E.; Senst, T.; Sikora, T. Extending IOU Based Multi-Object Tracking by Visual Information. In Proceedings of the 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Auckland, New Zealand, 27–30 November 2018; pp. 1–6. [Google Scholar] [CrossRef]
  69. Hou, X.; Wang, Y.; Chau, L.P. Vehicle Tracking Using Deep SORT with Low Confidence Track Filtering. In Proceedings of the 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Taipei, Taiwan, 18–21 September 2019; pp. 1–6. [Google Scholar] [CrossRef]
  70. Scheidegger, S.; Benjaminsson, J.; Rosenberg, E.; Krishnan, A.; Granstrom, K. Mono-Camera 3D Multi-Object Tracking Using Deep Learning Detections and PMBM Filtering. arXiv 2018, arXiv:1802.09975. [Google Scholar]
  71. Hu, H.N.; Cai, Q.Z.; Wang, D.; Lin, J.; Sun, M.; Kraehenbuehl, P.; Darrell, T.; Yu, F. Joint Monocular 3D Vehicle Detection and Tracking. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019; pp. 5389–5398. [Google Scholar] [CrossRef]
  72. Kutschbach, T.; Bochinski, E.; Eiselein, V.; Sikora, T. Sequential sensor fusion combining probability hypothesis density and kernelized correlation filters for multi-object tracking in video data. In Proceedings of the 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Lecce, Italy, 29 August–1 September 2017; pp. 1–5. [Google Scholar] [CrossRef]
  73. Lowe, D. Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Comput. Vis. 2004, 60, 91. [Google Scholar] [CrossRef]
  74. Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; Volume 1, pp. 886–893. [Google Scholar] [CrossRef]
  75. Ojala, T.; Pietikäinen, M.; Harwood, D. A comparative study of texture measures with classification based on featured distributions. Pattern Recognit. 1996, 29, 51–59. [Google Scholar] [CrossRef]
  76. Ahonen, T.; Hadid, A.; Pietikainen, M. Face Description with Local Binary Patterns: Application to Face Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2006, 28, 2037–2041. [Google Scholar] [CrossRef]
  77. Kuhn, H.W. The hungarian method for the assignment problem. Nav. Res. Logist. Q. 1955, 2, 83–97. [Google Scholar] [CrossRef]
  78. Bochinski, E.; Eiselein, V.; Sikora, T. High-Speed tracking-by-detection without using image information. In Proceedings of the 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Lecce, Italy, 29 August–1 September 2017; pp. 1–6. [Google Scholar] [CrossRef]
  79. Zhou, X.; Wang, D.; Krähenbühl, P. Objects as Points. arXiv 2019, arXiv:1904.07850. [Google Scholar]
  80. Dimitriou, N.; Stavropoulos, G.; Moustakas, K.; Tzovaras, D. Multiple object tracking based on motion segmentation of point trajectories. In Proceedings of the 2016 13th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Colorado Springs, CO, USA, 23–26 August 2016; pp. 200–206. [Google Scholar] [CrossRef]
  81. Cai, J.; Wang, Y.; Zhang, H.; Hsu, H.; Ma, C.; Hwang, J. IA-MOT: Instance-Aware Multi-Object Tracking with Motion Consistency. CoRR 2020, abs/2006.13458. Available online: http://xxx.lanl.gov/abs/2006.13458 (accessed on 7 September 2022).
  82. Yan, B.; Jiang, Y.; Sun, P.; Wang, D.; Yuan, Z.; Luo, P.; Lu, H. Towards Grand Unification of Object Tracking. arXiv 2022, arXiv:2207.07078. [Google Scholar]
  83. Yang, F.; Chang, X.; Dang, C.; Zheng, Z.; Sakti, S.; Nakamura, S.; Wu, Y. ReMOTS: Self-Supervised Refining Multi-Object Tracking and Segmentation. CoRR 2020, abs/2007.03200. Available online: http://xxx.lanl.gov/abs/2007.03200 (accessed on 7 September 2022).
  84. Voigtlaender, P.; Krause, M.; Osep, A.; Luiten, J.; Sekar, B.B.G.; Geiger, A.; Leibe, B. MOTS: Multi-Object Tracking and Segmentation. arXiv 2019, arXiv:1902.03604. [Google Scholar]
  85. Simon, M.; Amende, K.; Kraus, A.; Honer, J.; Sämann, T.; Kaulbersch, H.; Milz, S.; Gross, H.M. Complexer-YOLO: Real-Time 3D Object Detection and Tracking on Semantic Point Clouds. arXiv 2019, arXiv:1904.07537. [Google Scholar]
  86. Zhang, W.; Zhou, H.; Sun, S.; Wang, Z.; Shi, J.; Loy, C.C. Robust Multi-Modality Multi-Object Tracking. arXiv 2019, arXiv:1909.03850. [Google Scholar]
  87. Frossard, D.; Urtasun, R. End-to-end Learning of Multi-sensor 3D Tracking by Detection. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, QLD, Australia, 21–25 May 2018; pp. 635–642. [Google Scholar] [CrossRef]
  88. Weng, X.; Wang, Y.; Man, Y.; Kitani, K. GNN3DMOT: Graph Neural Network for 3D Multi-Object Tracking with Multi-Feature Learning. arXiv 2020, arXiv:2006.07327. [Google Scholar]
  89. Sualeh, M.; Kim, G.W. Visual-LiDAR Based 3D Object Detection and Tracking for Embedded Systems. IEEE Access 2020, 8, 156285–156298. [Google Scholar] [CrossRef]
  90. Shenoi, A.; Patel, M.; Gwak, J.; Goebel, P.; Sadeghian, A.; Rezatofighi, H.; Martín-Martín, R.; Savarese, S. JRMOT: A Real-Time 3D Multi-Object Tracker and a New Large-Scale Dataset. arXiv 2020, arXiv:2002.08397. [Google Scholar]
  91. Paszke, A.; Chaurasia, A.; Kim, S.; Culurciello, E. ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation. CoRR 2016, abs/1606.02147. Available online: http://xxx.lanl.gov/abs/1606.02147 (accessed on 7 September 2022).
  92. Milan, A.; Leal-Taixé, L.; Reid, I.; Roth, S.; Schindler, K. MOT16: A Benchmark for Multi-Object Tracking. arXiv 2016, arXiv:1603.00831. [Google Scholar]
  93. Dendorfer, P.; Rezatofighi, H.; Milan, A.; Shi, J.; Cremers, D.; Reid, I.D.; Roth, S.; Schindler, K.; Leal-Taixé, L. MOT20: A benchmark for multi object tracking in crowded scenes. CoRR 2020, abs/2003.09003. Available online: http://xxx.lanl.gov/abs/2003.09003 (accessed on 7 September 2022).
  94. Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, 16–21 June 2012; pp. 3354–3361. [Google Scholar]
  95. Behley, J.; Garbade, M.; Milioto, A.; Quenzel, J.; Behnke, S.; Stachniss, C.; Gall, J. SemanticKITTI: A Dataset for Semantic Scene Understanding of LiDAR Sequences. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019. [Google Scholar]
  96. Wen, L.; Du, D.; Cai, Z.; Lei, Z.; Chang, M.; Qi, H.; Lim, J.; Yang, M.; Lyu, S. UA-DETRAC: A New Benchmark and Protocol for Multi-Object Detection and Tracking. Comput. Vis. Image Underst. 2020, 193, 102907. [Google Scholar] [CrossRef]
  97. Lyu, S.; Chang, M.C.; Du, D.; Li, W.; Wei, Y.; Del Coco, M.; Carcagnì, P.; Schumann, A.; Munjal, B.; Choi, D.H.; et al. UA-DETRAC 2018: Report of AVSS2018 & IWT4S challenge on advanced traffic monitoring. In Proceedings of the 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Auckland, New Zealand, 27–30 November 2018; pp. 1–6. [Google Scholar]
  98. Lyu, S.; Chang, M.C.; Du, D.; Wen, L.; Qi, H.; Li, Y.; Wei, Y.; Ke, L.; Hu, T.; Del Coco, M.; et al. UA-DETRAC 2017: Report of AVSS2017 & IWT4S Challenge on Advanced Traffic Monitoring. In Proceedings of the 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Lecce, Italy, 29 August–1 September 2017; pp. 1–7. [Google Scholar]
  99. Luiten, J.; Osep, A.; Dendorfer, P.; Torr, P.H.S.; Geiger, A.; Leal-Taixé, L.; Leibe, B. HOTA: A Higher Order Metric for Evaluating Multi-Object Tracking. CoRR 2020, abs/2009.07736. Available online: http://xxx.lanl.gov/abs/2009.07736 (accessed on 7 September 2022).
  100. Henschel, R.; Leal-Taixé, L.; Cremers, D.; Rosenhahn, B. Fusion of Head and Full-Body Detectors for Multi-Object Tracking. arXiv arXiv:1705.08314, 2017.
  101. Henschel, R.; Zou, Y.; Rosenhahn, B. Multiple People Tracking Using Body and Joint Detections. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Long Beach, CA, USA, 16–17 June 2019; pp. 770–779. [Google Scholar] [CrossRef]
  102. Weng, X.; Kitani, K. A Baseline for 3D Multi-Object Tracking. CoRR 2019, abs/1907.03961. Available online: http://xxx.lanl.gov/abs/1907.03961 (accessed on 7 September 2022).
  103. Gloudemans, D.; Work, D.B. Localization-Based Tracking. CoRR 2021, abs/2104.05823. Available online: http://xxx.lanl.gov/abs/2104.05823 (accessed on 7 September 2022).
  104. Sun, S.; Akhtar, N.; Song, X.; Song, H.; Mian, A.; Shah, M. Simultaneous Detection and Tracking with Motion Modelling for Multiple Object Tracking. CoRR 2020, abs/2008.08826. Available online: http://xxx.lanl.gov/abs/2008.08826 (accessed on 7 September 2022).
  105. Luiten, J.; Fischer, T.; Leibe, B. Track to Reconstruct and Reconstruct to Track. IEEE Robot. Autom. Lett. 2020, 5, 1803–1810. [Google Scholar] [CrossRef]
  106. Wang, S.; Sun, Y.; Liu, C.; Liu, M. PointTrackNet: An End-to-End Network For 3-D Object Detection and Tracking From Point Clouds. arXiv 2020, arXiv:2002.11559. [Google Scholar] [CrossRef]
  107. Wang, L.; Zhang, X.; Qin, W.; Li, X.; Yang, L.; Li, Z.; Zhu, L.; Wang, H.; Li, J.; Liu, H. CAMO-MOT: Combined Appearance-Motion Optimization for 3D Multi-Object Tracking with Camera-LiDAR Fusion. arXiv 2022, arXiv:2209.02540. [Google Scholar]
  108. Sun, Y.; Zhao, Y.; Wang, S. Multiple Traffic Target Tracking with Spatial-Temporal Affinity Network. Comput. Intell. Neurosci. 2022, 2022, 9693767. [Google Scholar] [CrossRef]
  109. Messoussi, O.; de Magalhaes, F.G.; Lamarre, F.; Perreault, F.; Sogoba, I.; Bilodeau, G.; Nicolescu, G. Vehicle Detection and Tracking from Surveillance Cameras in Urban Scenes. CoRR 2021, abs/2109.12414. Available online: http://xxx.lanl.gov/abs/2109.12414 (accessed on 7 September 2022).
  110. Wang, G.; Gu, R.; Liu, Z.; Hu, W.; Song, M.; Hwang, J. Track without Appearance: Learn Box and Tracklet Embedding with Local and Global Motion Patterns for Vehicle Tracking. CoRR 2021, abs/2108.06029. Available online: http://xxx.lanl.gov/abs/2108.06029 (accessed on 7 September 2022).
Figure 1. General framework of MOT systems. Visual and motion features of the detected objects at frame T are extracted and compared to those detected from previous frames. A robust data association algorithm would be able to match the features of the same objects. The final output of the system would be tracked with unique IDs identifying the multiple objects detected and tracked over the multiple frames.
Figure 1. General framework of MOT systems. Visual and motion features of the detected objects at frame T are extracted and compared to those detected from previous frames. A robust data association algorithm would be able to match the features of the same objects. The final output of the system would be tracked with unique IDs identifying the multiple objects detected and tracked over the multiple frames.
Applsci 12 09408 g001
Figure 2. Preserving the ID during full occlusion. The frames are obtained from the MOT15 dataset [19]. The yellow arrow is pointing towards a track (top image) that experiences full occlusion (middle image). The objective of the MOT system is to preserve the ID of the track (bottom image) and matches the previously detected object with the reappeared one.
Figure 2. Preserving the ID during full occlusion. The frames are obtained from the MOT15 dataset [19]. The yellow arrow is pointing towards a track (top image) that experiences full occlusion (middle image). The objective of the MOT system is to preserve the ID of the track (bottom image) and matches the previously detected object with the reappeared one.
Applsci 12 09408 g002
Figure 3. The point tracking approach in Ref. [46]. The current and previous frames are passed into the centerTrack network, which utilize the motion feature to detect and match tracks.
Figure 3. The point tracking approach in Ref. [46]. The current and previous frames are passed into the centerTrack network, which utilize the motion feature to detect and match tracks.
Applsci 12 09408 g003
Figure 4. Track tree proposed in Track tree method system overview. The strength of each branch depends on a score evaluated by the matching algorithm. The green lines indicate matched tracks. The red circle indicate a lost track.
Figure 4. Track tree proposed in Track tree method system overview. The strength of each branch depends on a score evaluated by the matching algorithm. The green lines indicate matched tracks. The red circle indicate a lost track.
Applsci 12 09408 g004
Figure 5. The hierarchical single-branch network proposed in Ref. [52]. The frames from a video source are passed into the network and outputs detections and tracks.
Figure 5. The hierarchical single-branch network proposed in Ref. [52]. The frames from a video source are passed into the network and outputs detections and tracks.
Applsci 12 09408 g005
Figure 6. P3AFormer system overview [57]. The current and previous frames are passed for features extraction and detection. The extraction module consists of a backbone network, pixel-level decoder, and a detector, which is illustrated on the right. The features are passed to MLP heads to output the class, center, and size features.
Figure 6. P3AFormer system overview [57]. The current and previous frames are passed for features extraction and detection. The extraction module consists of a backbone network, pixel-level decoder, and a detector, which is illustrated on the right. The features are passed to MLP heads to output the class, center, and size features.
Applsci 12 09408 g006
Figure 7. Motion segmentation. The trajectory points are projected across multiple frames.
Figure 7. Motion segmentation. The trajectory points are projected across multiple frames.
Applsci 12 09408 g007
Figure 8. Complexer-YOLO [85]. Data from RGB frame and point cloud data are mapped and passed into Complexer-YOLO for tracking and matching.
Figure 8. Complexer-YOLO [85]. Data from RGB frame and point cloud data are mapped and passed into Complexer-YOLO for tracking and matching.
Applsci 12 09408 g008
Figure 9. An end-to-end approach for 3D detection and tracking [87]. The RGB and point cloud data are passed into a detection network. Matching and scoring nets are then trained to generate trajectories across multiple frames.
Figure 9. An end-to-end approach for 3D detection and tracking [87]. The RGB and point cloud data are passed into a detection network. Matching and scoring nets are then trained to generate trajectories across multiple frames.
Applsci 12 09408 g009
Figure 10. Samples from MOT 15-17 ground truth dataset. Samples of MOT15 (top image), MOT16 (middle image), and MOT17 (bottom image).
Figure 10. Samples from MOT 15-17 ground truth dataset. Samples of MOT15 (top image), MOT16 (middle image), and MOT17 (bottom image).
Applsci 12 09408 g010
Figure 11. Samples of the KITTI dataset including Pointcloud and RGB. Visual odometry trajectory (top left), disparity and optical flow map (top right), visualized point cloud data [95] (middle), and 3D labels (bottom).
Figure 11. Samples of the KITTI dataset including Pointcloud and RGB. Visual odometry trajectory (top left), disparity and optical flow map (top right), visualized point cloud data [95] (middle), and 3D labels (bottom).
Applsci 12 09408 g011
Figure 12. Samples of the UA-DETRAC dataset showing variation of illumination in the environment from a static camera.
Figure 12. Samples of the UA-DETRAC dataset showing variation of illumination in the environment from a static camera.
Applsci 12 09408 g012
Table 1. Recent reviews and the data used for evaluation and comparison.
Table 1. Recent reviews and the data used for evaluation and comparison.
ReviewYearEvaluation Dataset
Ciaparrone et al. [13]2019MOT 15, 16, 17
Xu et al. [14]2019MOT 15, 16
Luo et al. [12]2022PETS2009-S2L1
Ours2022KITTI, MOT 15, 16, 17, 20, and UA_DETRAC
Table 2. Summary of the components used in MOT techniques.
Table 2. Summary of the components used in MOT techniques.
TrackerAppearance CueMotion CueData AssociationModeOcclusion
Handling
ine Keuper et al. [37]-Optical flowCorrelation Co-ClusteringOnlinePoint trajectories
Fang et al. [38]fc8 layer of the
inception network
Relative center
coordinates
Conditional ProbabilityOnlineTrack history
Xiang et al. [32]VGG-16LSTM networkMetric LearningOnlineTrack history
Chu et al. [39]CNN-Reinforcement LearningOnlineSVM Classifier
Zhu et al. [40]ECO-Dual Matching
Attention Networks
OnlineSpatial attention
network
Zhou et al. [41]ResNet50Linear ModelSiamese NetworksOnlineTrack history
Sun et al. [42]VGG-like network-Affinity estimator-Track history
Peng et al. [43]ResNet-50 + FPN-Prediction NetworkOnlineTrack history
Wang et al. [44]FaceNet-Multi-Scale TrackletNet-Track history
Mahmoudi et al. [34]CNNRelative mean
velocity and position
Hungarian AlgorithmOnlineTrack history
Zhou et al. [35]ResNet-50Kalman FilterHungarian AlgorithmOnlineTrack history
Lan et al. [45]DecoderRelative positionUnary PotentialNear-onlineTrack history
Zhou et al. [46]-CenterTrack2D displacement predictionOnlineTrack history
Karunasekera et al. [33]Matching histogramsRelative positionHungarian AlgorithmOnlineGrid Structure
+ Track history
Chu et al. [47]Siamese-style network-R1TA Power Iteration layerOnlineTrack history
ine Zhao et al. [36]CNN + PCA
+ Correlation Filter
-Hungarian AlgorithmOnlineAPCE + IoU
Chen et al. [48]Faster-RCNN (VGG16)-Category ClassifierOnlineTrack history
Sadeghian et al. [31]VGG16 + LSTM networkLSTM networkRNNOnlineTrack history
Yoon et al. [49]Encoder-DecoderOnlineTrack history
Xu et al. [50]FairMOTKalman FilterHungarian AlgorithmOnlineTrack history
Ye et al. [51]LDAEKalman FilterCosine DistanceOnlineTrack History
Wang et al. [52]Faster-RCNN (ResNet50)Kalman FilterFaster RCNN (ResNet50)OnlineTrack history
Yu et al. [53]DLA-34 + GCD + GTE-Hungarian AlgorithmOnlineTrack history
Wang et al. [54]PCBKalman FilterRTU++OnlineTrack history
Gao et al. [55]ResNet34-Depth wise Cross CorrelationOnlineTrack history
Nasseri et al. [56]-Kalman FilterNormalized IoUOnlineTrack history
Zhao et al. [57]Transformer + Pixel DecoderKalman FilterCosine DistanceOnlineTrack history
Aharon et al. [58]FastReID’s SBS-50Kalman FilterCosine DistanceOnlineTrack history
Seidenschwarz et al. [59]ResNet50Linear ModelHistogram DistanceOnlineTrack history
dai et al. [60]CNN-Hungarian AlgorithmOnlineTrack history
Zhang et al. [61]FairMOTKalman FilterIoU + Hungarian AlgorithmOnlineTrack history
Hyun et al. [62]FairMOT-Sparse Graph + HungarianOnlineTrack history
ine Chen et al. [63]DETR Network-Hungarian AlgorithmOnlineTrack history
Cao et al. [64]-Kalman Filter
+ non-linear model
+ Smoothing
Observation Centric RecoveryOnlineTrack history
Wan et al. [65]YOLOx-Self and Cross Attention LayersOnlineTrack history
Du et al. [66]-NST Kalman Filter + ResNet50AFLinkOnlineTrack history
Zhang et al. [67]ResNet50Kalman FilterHungarian AlgorithmOnlineTrack history
Table 3. Summary of the sensors fusion approaches used in 3D MOT techniques.
Table 3. Summary of the sensors fusion approaches used in 3D MOT techniques.
TrackerRGB CameraPoint Cloud LIDARData FusionTarget Tracking
ine Simon et al. [85]Enet3D Voxel QuantizationComplex-YOLOLMB RFS
Zhang et al. [86]VGG-16PointNetPoint-wise convolution
+ Start and end estimator
Linear Programming
Frossard et al. [87]MV3DMV3DMV3DLinear Programming
Hu et al. [71]Faster R-CNN34-layer DLA-upMonocular 3D estimationLSTM Network
Weng et al. [88]ResNetPointNetAdditionGraph Neural Network
Sualeh et al. [89]YOLOv3IMM-UKF-JPDAF3D projectionMunkres Association
Shenoi et al. [90]Mask RCNNF-PointNet3-layered fully connected networkJPDA
Table 4. Examples of the types of data included in MOT15 and MOT16.
Table 4. Examples of the types of data included in MOT15 and MOT16.
DatasetSequencesLengthTracksFPSPlatformViewPointDensityWeather
MOT 2015TUD-Crossing2011325statichorizontal5.5cloudy
PETS09-S2L2436427statichigh22.1cloudy
ETH-Jelmoli4404514movinglow5.8sunny
KITTI-162091710statichorizontal8.1sunny
MOT 2016MOT16-014502330statichorizontal14.2cloudy
MOT16-03150014830statichigh69.7night
MOT16-06119422114movinglow9.7sunny
MOT16-129008630movinghorizontal9.2indoor
Table 5. The performance using the MOT15, 16, 17, and 20 datasets.
Table 5. The performance using the MOT15, 16, 17, and 20 datasets.
DatasetTrackerMOTA ↑MOTP ↑IDF1 ↑MT ↑ML ↓FP ↓FN ↓IDS ↓Frag ↓Hz ↑HOTA ↑Protocol
MOT15Chen et al. [48]53.075.5 29.120.2515922,9847081476 Private
Chen et al. [48]38.572.6 8.737.4400533,2045861263 Public
Sadeghian et al. [31]37.671.7 15.826.8793329,397102620241 Public
Keuper et al. [37]35.6 45.123.239.310,58028,508457969 Private
Fang et al. [38]56.573.061.345.114.6938616,92142813645.1 Public
Xiang et al. [32]37.172.528.412.639.7830529,73258011931.0 Private
Chu et al. [39]38.970.644.516.631.5732129,50172014400.3 Public
Sun et al. [42]38.371.145.617.641.212902700164815156.3 Public
Zhou et al. [46]30.571.21.17.641.2653435,28487922085.9 Private
Dimitriou et al. [80]24.270.2 7.149.2640039,6595291034 Private
MOT16Yoon et al. [49]22.570.925.96.461.9734639,09211591538172.8 Public
Sadeghian et al. [31]47.2 75.814.041.6268192,85677416751.0 Public
Keuper et al. [37]47.1 52.320.446.9670389,368370598 Private
Fang et al. [38]6378.8 39.922.113,66353,2484821251 Public
Xiang et al. [32]48.376.7 15.440.1270691,047543 0.5 Private
Zhu et al. [40]46.173.854.817.442.7790989,87453216160.3 Private
Zhou et al. [41]64.878.673.540.622.013,47049,927794105018.2 Private
Chu et al. [39]48.875.747.215.838.1587586,56790611160.1 Public
Peng et al. [43]67.678.457.232.923.1893448,3051897 34.4 Private
Henschel et al. [100]47.8 47.819.138.2888685,487852 Private
Wang et al. [44]49.2 56.117.340.3840083,702 882 Public
Mahmoudi et al. [34]65.278.4 32.421.3657855,896946228311.2 Public
Zhou et al. [35]40.874.4 13.738.315,14391,792105122106.5 Private
Lan et al. [45]45.474.4 18.138.713,40785,547600930 Public
Xu et al. [50]74.4 73.746.115.2 60.2Private
Ye et al. [51]62.879.463.134.417.212,46354,648701983 Private
wang et al. [52]50.4 47.5 18,37069,8001826 Private
Yu et al. [53]75.680.975.843.121.5978634,214448 61.7Private
Dai et al. [60]63.8 70.630.330.6741257,975629 11.254.7Private
Hyun et al. [62]76.7 73.149.110.710,68930,4281420 Private
MOT17Keuper et al. [37]51.2 20.737.424,986248,32818512991 Private
Zhu et al. [40]48.275.955.719.338.326,218263,6082194537811.4 Private
Sun et al. [42]52.453.976.949.521.425,423234,592843114,7976.3 Public
Peng et al. [43]66.678.257.432.224.222,284160,4915529 34.4 Private
Henschel et al. [100]51.3 47.621.435.224,101247,9212648 Private
Wang et al. [44]51.9 58.023.535.537,311231,658 2917 Public
Henschel et al. [101]52.6 50.819.735.831,572232,57230503792 Private
Zhou et al. [46]61.5 59.626.431.914,076200,6722583 Private
Karunasekera et al. [33]46.976.1 16.936.3 4478 17.1 Private
Xu et al. [50]73.173.04516.8 59.7Private
Ye et al. [51]61.779.362.032.520.132,863181,03518093168 Private
Yu et al. [53]73.881.074.741.723.227,999118,6231374 61.0Private
Wang et al. [54]79.5 79.1 29,50884,6181302204624.763.9Private
Gao et al. [55]70.2 65.543.115.430,367115,9863265 10.7 Private
Nasseri et al. [56]80.4 77.7 28,88779,32923254689 63.3Private
Zhao et al. [57]81.2 78.154.513.217,28186,8611893 Private
Aharon et al. [58]80.5 80.2 22,52186,0371212 4.565.0Private
Seidenschwarz et al. [59]61.7 64.226.332.2 1639 50.9Private
Dai et al. [60]63.0 69.930.430.623,022183,6591842 10.854.6Private
Zhang et al. [61]62.1 65.0 24,052188,2641768 Public
Zhang et al. [61]77.3 75.9 45,03079,7163255 Private
Hyun et al. [62]76.5 73.647.612.729,80899,5103369 Private
Chen et al. [63]76.5 72.6 59.7Private
Cao et al. [64]78.0 77.5 15,100108,00019502040 63.2Private
Wan et al. [65]67.2 66.538.320.517,875164,0322896 Public
Wan et al. [65]75.6 76.448.816.226,983108,1862394 Private
Du et al. [66]79.6 79.5 1194 7.164.4Private
Zhang et al. [67]80.3 77.3 25,49183,7212196 29.663.1Private
MOT20Xu et al. [50]59.7 67.866.57.7 54.0Private
Ye et al. [51]54.979.159.136.223.519,953211,39216302455 Private
Wang et al. [52]51.5 44.5 38,223208,6164055 Private
Yu et al. [53]67.279.270.562.28.961,134104,5974243 56.5Private
Wang et al. [54]76.5 76.8 19,247101,290971119015.762.8Private
Nasseri et al. [56]76.8 76.4 27,10691,74014463053 61.4Private
Zhao et al. [57]78.1 76.470.57.425,41386,5101332 Private
Aharon et al. [58]77.8 77.5 24,63888,8631257 2.463.3Private
Seidenschwarz et al. [59]52.7 55.029.126.8 1437 43.4Private
Dai et al. [60]60.1 66.746.517.837,657165,8662926 4.051.7Private
Zhang et al. [61]55.6 65.0 12,297216,986480 Public
Hyun et al. [61]66.3 67.7 41,538130,0722715 Private
Hyun et al. [62]72.8 70.564.312.825,165112,8972649 Private
Chen et al. [63]67.1 59.1 50.4Private
Cao et al. [64]75.5 75.9 18,000108,0009131198 63.2Private
Wan et al. [65]70.4 71.965.19.648,343101,0343739 Private
Du et al. [66]73.8 77.0 25,49183,7212196 1.462.6Private
Zhang et al. [67]77.8 75.2 26,24987,5941223 17.561.3Private
Table 6. The performance using the KITTI dataset. We have marked the highest scores in bold for methods evaluated on all categories.
Table 6. The performance using the KITTI dataset. We have marked the highest scores in bold for methods evaluated on all categories.
TrackerMOTA ↑MOTP ↑MT ↑ML ↓FP ↓FN ↓IDS ↓Frag ↓Type
ine Chu et al. [47]77.178.851.48.97606998123 Car
Simon et al. [85]75.778.558.05.1 All
Zhang et al. [86]84.885.273.22.777114243284753All
Scheidegger et al. [70]80.481.362.86.2 121613Car
Frossard et al. [87]76.1583.4260.08.31 296868All
Hu et al. [71]84.585.673.42.87054242 All
Weng et al. [88]82.284.164.96 142416Car
Zhao et al. [36]71.381.848.35.9 All
Weng et al. [102]86.278.43 015Car
Weng et al. [102]70.9 Pedestrian
Weng et al. [102]84.9 Cyclist
Luiten et al. [105]84.8 6814260275 Car
Wang et al. [106]68.276.660.612.3 111725All
Sualeh et al. [89]78.1079.370.39.1 21111Car
Sualeh et al. [89]46.167.630.338.7 57500Pedestrian
Sualeh et al. [89]70.977.671.414.3 324Cyclist
Shenoi et al. [90]85.785.5 98 Car
Shenoi et al. [90]46.072.6 395 Pedestrian
ine Zhou et al. [46]89.485.182.32.3 116744All
Karunasekera et al. [33]85.085.574.32.8 301 All
Zhao et al. [57]91.2 86.52.3 Car
Zhao et al. [57]67.7 49.114.5 Pedestrian
Aharon et al. [58]90.3 250280Car
Aharon et al. [58]65.1 204609Pedestrian
Wang et al. [107]90.485.084.67.382322962 Car
Wang et al. [107]52.264.535.425.411122560 Pedestrian
Sun et al. [108]86.985.783.12.9 271254Car
Table 7. The performance using the UA_DETRAC dataset.
Table 7. The performance using the UA_DETRAC dataset.
DatasetTrackerPR-MOTA ↑PR-MOTP ↑PR-MT ↑PR-ML ↓PR-FP ↓PR-FN ↓PR-IDS ↓PR-FRAG ↓
UA_DETRACBochinski et al. [68]30.7373222.618,046179,1913631123
Kutschbach et al. [72]14.5361418.138,597174,0437991607
Hou et al. [69]30.336.330.221.020,263179,3173891260
Sun et al. [42]20.226.314.518.19747135,978
Chu et al. [47]19.836.717.118.214,989164,433617
Gloudemans et al. [103]46.469.541.116.3
Sun et al. [104]12.2 10.814.936,355192,289228674
Messoussi et al. [109]31.250.928.118.56036170,700252
Wang et al. [110]22.535.215.510.1 15633186
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Gad, A.; Basmaji, T.; Yaghi, M.; Alheeh, H.; Alkhedher, M.; Ghazal, M. Multiple Object Tracking in Robotic Applications: Trends and Challenges. Appl. Sci. 2022, 12, 9408. https://doi.org/10.3390/app12199408

AMA Style

Gad A, Basmaji T, Yaghi M, Alheeh H, Alkhedher M, Ghazal M. Multiple Object Tracking in Robotic Applications: Trends and Challenges. Applied Sciences. 2022; 12(19):9408. https://doi.org/10.3390/app12199408

Chicago/Turabian Style

Gad, Abdalla, Tasnim Basmaji, Maha Yaghi, Huda Alheeh, Mohammad Alkhedher, and Mohammed Ghazal. 2022. "Multiple Object Tracking in Robotic Applications: Trends and Challenges" Applied Sciences 12, no. 19: 9408. https://doi.org/10.3390/app12199408

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop