*Review* **A Review of Tracking and Trajectory Prediction Methods for Autonomous Driving**

**Florin Leon and Marius Gavrilescu \***

Faculty of Automatic Control and Computer Engineering, "Gheorghe Asachi" Technical University of Ia¸si, Bd. Mangeron 27, 700050 Ia¸si, Romania; florin.leon@academic.tuiasi.ro **\*** Correspondence: marius.gavrilescu@academic.tuiasi.ro

**Abstract:** This paper provides a literature review of some of the most important concepts, techniques, and methodologies used within autonomous car systems. Specifically, we focus on two aspects extensively explored in the related literature: tracking, i.e., identifying pedestrians, cars or obstacles from images, observations or sensor data, and prediction, i.e., anticipating the future trajectories and motion of other vehicles in order to facilitate navigating through various traffic conditions. Approaches based on deep neural networks and others, especially stochastic techniques, are reported.

**Keywords:** autonomous driving; object tracking; trajectory prediction; deep neural networks; stochastic methods

#### **1. Introduction**

Autonomous car technology is already being developed by many companies on different types of vehicles. Complete driverless systems are still at an advanced testing phase, but partially automated systems have been around in the automotive industry for the last few years. Autonomous driving technology has been the focus of multiple research and development efforts by various car manufacturers, universities, and research centers, since the middle 1980s.

A famous competition was the DARPA Urban Challenge in 2007. Other examples include the European Land-Robot Trial, which has been held since 2006, the Intelligent Vehicle Future Challenge, between 2009 and 2013, as well as the Autonomous Vehicle Competition, held between 2009 and 2017. Since the early stages of autonomous driving technology development, research in the related fields has been garnering significant interest in universities and industry worldwide.

In this review, we focus on two aspects of an autonomous car system:


The paper is composed of two main parts that focus on these topics.

Section 2 deals with tracking problems as addressed in the related literature. We cover aspects concerning the extraction and use of various features for the detection of pedestrians, vehicles, and obstacles across sequences of images and sensor data. Also, we address the various ways in which authors tackle the problems of ensuring detection consistency, temporal coherence, or occlusion handling. We present methods using deep neural networks, but also alternative, conventional approaches.

Section 3 addresses the problem of motion and behavior prediction in traffic scenarios. We discuss various solutions proposed in the related literature for predicting the trajectory

**Citation:** Leon, F.; Gavrilescu, M. A Review of Tracking and Trajectory Prediction Methods for Autonomous Driving. *Mathematics* **2021**, *9*, 660. https://doi.org/10.3390/math9060660

Academic Editor: Denis N. Sidorov

Received: 28 January 2021 Accepted: 17 March 2021 Published: 19 March 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

of the ego car with respect to the behavior of other traffic participants. We address methods based on deep neural networks and stochastic models, as well as various mixed approaches.

Section 4 contains some conclusions with regard to the aspects discussed throughout the paper.

#### **2. Tracking Methods**

Object tracking is an important part of ensuring accurate and efficient autonomous driving. The identification of objects such as pedestrians, cars, and various obstacles from images and vehicle sensor data is a significant and complex interdisciplinary domain. It involves contributions from computer vision, signal processing, and/or machine learning. Object tracking is an essential part of ensuring safe autonomous driving, since it can aid in obstacle avoidance, motion estimation, the prediction of the intentions of pedestrians and other vehicles, as well as path planning. Most sensor data that have to be processed take the form of point clouds, images, or a combination of the two. Point cloud data may be handled in a multitude of ways, the most common of which is some form of 3D grid, where a voxel engine is used to traverse the point space. Some situations call for a reconstruction of the environment from the point cloud which involves various means of resampling and filtering. In some instances, stereo visual information is available and disparities must be computed from the left-right images. Stereo matching is not a trivial task and has the drawback that the computations required for reasonable accuracy usually have a significant impact on performance. In other cases, multiple types of sensor data are available, thereby requiring registration, point matching, and image/point cloud fusion. The problem is further complicated by the necessity to account for temporal cues and to estimate motion from time-based frames.

The scenes involved in autonomous driving scenarios rarely feature a single individual target. Most commonly, multiple objects must be identified and tracked concurrently, some of which may be in motion relative to the vehicle and to each other. As such, most approaches in the related literature handle more than one object and are therefore aimed at solving multiple object tracking problems (MOT).

The tracking problem can be summarized as follows: a sequence of sensor data is available from one or multiple vehicle-mounted acquisitions devices. Considering that several observations are identified in all or some of the frames from the sequence, how can the observations from each frame be associated with a set of objects (pedestrians, vehicles, and various obstacles) and how can the trajectories of each such object be reconstructed and predicted as accurately as possible?

Most related methods involve assigning an ID or identifying a response for all objects detected within a frame, and then attempting to match the IDs across subsequent frames. This is often a complex task, considering that the tracked objects may enter and leave the frame at different timestamps. They may also be occluded by the environment or may occlude each other. Additional problems may be caused by defects in the acquired images: noise, sampling or compression artifacts, aliasing, or acquisition errors.

Object tracking for automated driving most commonly has to operate on real-time video. As such, the objective is to correlate tracked objects across multiple video frames, in addition to individual object identification. Accounting for variations in motion comes with an additional set of pitfalls, such as when objects are affected by rotation or scaling transformations, or when the movement speed of the objects is high relative to the frame rate.

In the majority of cases, images are the primary modality for perceiving the scene. As such, a lot of efforts from the related literature are in the direction of 2D MOT. These methods are based on a succession of detection and tracking steps: consecutive detections that are similarly classified are linked together to determine trajectories. A significant challenge comes from the inevitable presence of noise in the acquired images, which may adversely change the features of similar objects across multiple frames. Consequently, the computation of robust features is an important aspect of object detection. Features are

representative of a wide array of object properties: color, frequency and distribution, shape, geometry, contours, or correlations within segmented objects. Nowadays, the most popular feature detection methods involve supervised learning. Features start out as groups of random values and are progressively refined using machine learning algorithms. Such approaches require appropriate training data and a careful selection of hyperparameters, often through trial-and-error. However, many results from the related literature show that supervised classification and regression methods offer the best results both in terms of accuracy and robustness to affine transformations, occlusion, and noise.

#### *2.1. Methods Using Neural Networks*

In terms of classifying objects from images, neural networks have seen a steady rise in popularity in recent years, particularly the more elaborate and complex convolutional and recurrent networks from the field of deep learning. Neural networks have the advantage of being able to learn important and robust features given training data that is relevant and in sufficient quantity. Considering that a significant percentage of automotive sensor data consists of images, convolutional neural networks (CNNs) are seeing widespread use in the related literature, for both classification and tracking problems. The advantage of CNNs over more conventional classifiers lies in the convolutional layers, where various filters and feature maps are obtained during training. CNNs are capable of learning object features by means of multiple complex operations and optimizations. The appropriate choice of network parameters and architecture can ensure that these features contain the most useful correlations that are needed for the robust identification of the targeted objects. While this choice is most often an empirical process, a wide assortment of network configurations exist in the related literature that are aimed at solving classification and tracking problems, with high accuracies claimed by the authors. Where object identification is concerned, in some cases the output of the fully-connected component of the CNN is used, whereas in other situations the values of the convolutional layers are exploited in conjunction with other filtering and refining methods.

#### 2.1.1. Learning Features from Convolutional Layers

Many results from the related literature systematically demonstrate that convolutional features are more useful for tracking than other explicitly computed ones (Haar, Fused Histogram of Oriented Gradients (FHOG), color labeling). An example in this sense is [1], which handles MOT using combinations of values from convolutional layers located at multiple levels. The method is based on the notion that lower-level layers account for a larger portion of the input image and therefore contain more details from the identified objects. This makes them useful, for instance, for handling occlusion. Conversely, top-level layers are more representative of semantics and are useful in distinguishing objects from the background. The proposed CNN architecture uses dual fully-connected components, for higher and lower-level features, which handle instance-level and category-level classification (Figure 2 in [1]). The proper identification of objects, particularly where occlusion events occur, involves the generation of appearance models of the tracked objects. These often result from the appropriate processing of the features learned within convolutional layers.

In [2], the authors note that the output of the fully-connected component of a CNN is not suitable for handling infrared images. Their attempt to directly transfer CNNs pretrained with traditional images for use with infrared sensor data is unsuccessful, since only the information from the convolutional layers seem to be useful for this purpose. Furthermore, the layer data itself require some level of adaptation to the specifics of infrared images. Typically, infrared data offer much less spatial information than visual images. It is much more suited, for example, in depth sensors for gathering distances to objects, albeit at a significantly lower resolution compared to regular image acquisition. As such, convolutional layers from infrared images are used in conjunction with correlation filters to generate a set of weak trackers. This process provides response maps with regard to the targets' locations. The weak trackers are then combined in ensembles which form

stronger response maps with a much greater tracking accuracy. The response map of an image is, generally, an intensity image where higher values indicate a change or a desired feature/shape/structure, as the original image is processed by an operator or correlation filter of some kind. By matching or fusing responses from multiple images within a video sequence, one could identify similar objects (i.e., the same pedestrian) across the sequence and subsequently construct their trajectories.

The potential of correlation filters is also exploitable for regular images. These have the potential to boost the information extracted from the activations of convolutional layers. In [3] the authors find that by applying appropriate filters to information drawn from shallow convolutional layers, a level of robustness similar to using deeper layers or a combination of multiple layers can be achieved. In [4], the authors also note the added robustness obtainable by post-filtering convolutional layers. By using particle and correlation filters, basic geometric and spatial features can be deduced for the tracked objects, which, together with a means of adaptively generating variable models, can be made to handle both simple and complex scenes.

An alternative approach can be found in [5], where discriminative correlation filters are used to generate an appearance model from a small number of samples. The overall approach involves feature extraction, post-processing, and the generation of response maps for carrying out better model updates within the neural network. Contrary to other similar results, the correlation filters used throughout the system are learned within a one-layer CNN, which eventually can be used to make predictions based on the response maps. Furthermore, residual learning is employed in order to avoid model degradation, instead of the much more frequently-used method of stacking multiple layers. Other tracking methods learn a similar kind of mapping from samples in the vicinity of the target object using deep regression [6,7], or by estimating and learning depth information [8].

The authors of [9] note that correlation filters have limitations imposed by the feature map resolution. They propose a novel solution where features are learned in a continuous domain, using an appropriate interpolation model. This allows for the more effective resolution-independent compositing of multiple feature maps, resulting in superior classification results.

Methods based on discriminative correlation filters are notoriously prone to excessive complexity and overfitting, and various means are available for optimizing the more traditional methods. The most noteworthy in this sense is [10], who employs efficient convolution operators, a training sample distribution scheme and an optimal update strategy in an attempt to boost performance and reduce the number of parameters. A promising result that demonstrates significant robustness and accuracy is [11], who use a CNN where the first set of layers are shared, as in a standard CNN. These layers then branch into multiple domain-specific ones. This approach has the benefit of splitting the tracking problem into subproblems which are solved separately in their respective layer sets. Each domain has its own training sequences and can be customized to address a specific issue, such as distinguishing a target with specific shape parameters from the background. A similar concept is exploited by [12], i.e., a network with components distinctly trained for a specific problem. In this case, multiple recurrent layers are used to model different structural properties of the tracked objects, which are incorporated into a parent CNN with the same purpose of improving accuracy and robustness. The Recurrent Neural Network (RNN) layers generate what the authors refer to as "structurally-aware feature maps" which, when combined with pooled versions of their non-structurally aware counterparts, significantly improve the classification results.

#### 2.1.2. High-Level Features, Occlusion Handling, and Feature Fusion

Appearance models offer high-level features that are also used to account for occlusion in much simpler and efficient systems. In [13], appearance descriptors are compounded to form an appearance space. With properly-determined metrics, observations having a similar appearance are identified using a nearest-neighbor approach. Switching from image-space to an appearance space seems to effectively handle occlusions, reducing their negative impact at a negligible performance cost.

A possible alternative to appearance-based classification is the use of template-based metrics. Such an approach uses a reference region of interest (ROI) drawn from one or multiple frames and attempts to match it in subsequent frames using an appropriatelyconstructed metric. Template-based methods perform well for partial detections, thereby accounting for occlusion and/or noise. This is because the template need not be perfectly or completely matched for a successful detection to occur. An example of a template-based method is provided by [14], which involves three CNNs, one for template generation, one dedicated to region searching and one for handling background areas. The method is somewhat similar to what could be achieved by a generative adversarial network (GAN). A "searcher" network attempts to fit multiple subimages within the positive detections provided by the template component while simultaneously attempting to maximize the distance to the negative background component. The candidate subimages generated by the three components are fed through a loss function that is designed to favor candidates closer to template regions than to background ones. Performance-wise, such an approach is claimed to provide impressive framerates and care should be taken when using template or reference-based methods. These are generally suited for situations where there is no significant variation in the overall tone of the frames. Such methods have a much higher failure rate when, for instance, the lighting conditions change during tracking. An example of this phenomenon is when the tracked object moves from a brightly-lit area to a shaded one.

An improvement on the use of appearance and shared tracking information is provided by [15] in the form of a CNN-based single object tracker that generates and adapts the appearance models for multi-frame detection (Figure 3 in [15]). The use of pooling layers and shared features accounts for drift effects caused by occlusion and inter-object dependency. A spatial and temporal attention mechanism is responsible for dynamically discriminating between training candidates based on the level of occlusion. Training samples are weighted based on their occlusion status, which optimizes the training process where both classification accuracy and performance are concerned. Generally speaking, pooling operations have two important effects: on the one hand, the subimage of the feature map is increased, since a pooled feature map contains information from a larger area of the originating image; on the other hand, the reduced size of a pooled map means fewer computational resources are required to process it, which improves performance. The major downside of pooling is that spatial positioning is further diluted with each additional layer. Multiple related papers exploit the so called "ROI pooling", which commonly refers to a pooling operation being applied to the bounding box of an identified object. The resulting reduced representation will hopefully be more robust to noise and geometric variations across multiple frames. ROI pooling is successfully used by [16] to improve the performance of their CNN-based classifier. The authors observe that positioning cues are adversely affected by pooling. A potential solution is to reposition the misaligned ROIs via bilinear interpolation. This reinterpretation of pooling is referred to as "ROI align". The gain in performance is significant, while the authors demonstrate that the positioning of the ROIs is stabilized.

Tracking stabilization is fundamental in automotive application, where effects such as jittering, camera shaking, and spatial/temporal noise commonly occur. Occlusion handling plays an important role in ensuring ROI stability and accuracy. Some authors handle this topic extensively, such as [17], who propose a deep neural network for tracking occluded body parts, by processing features extracted from a VGG19 network. Some authors use different interpretations of the feature concept, adapted to the specifics of autonomous driving. Reference [18] creates custom feature maps by encoding various properties of the detections in raster images (bounding boxes, positions, velocities, accelerations). These images are sent through a CNN that generates raster features that the authors demonstrate to provide more reliable correlations and more accurate trajectories than using features derived directly from raw data.

The idea of tracking robustness and stability is sometimes solvable using image and object fusion. The related methods are referred to as being "instance-aware". This concept means that a targeted object is matched across the image space and across multiple frames by fusing identified objects with similar characteristics. Reference [19] proposes a fusionbased method that uses single-object tracking to identify multiple candidate instances. Subsequently, it builds target models for potential objects by fusing information from detection and background cues. The models are updated using a CNN, which ensures robustness to noise, scaling, and minor variations of the targets' appearance. As with many other related approaches, an online implementation offloads most of the processing to an external server leaving the embedded device from the vehicle to carry out only minor, frequent tasks. Since quick reactions of the system are crucial for safe vehicle operation, performance and a rapid response of the underlying software is essential, which is why the online approach is popular in this field. Fusion methods are also applied for multimodal inputs, such as in [20], who propose a model based on a convolutional autoencoder to obtain features from a combination of multiple sensor sources, in order to account for improved environment perception.

Also in the context of ensuring robustness and stability, some authors apply fusion techniques to information extracted from convolutional layers. It has been previously mentioned that important correlations can be drawn from deep and shallow layers that can be exploited together for identifying robust features in the data. This principle is used for instance in [21]. In order to ensure robustness and performance, various features extracted from layers in different parts of a CNN are fused to form stronger characteristics that are affected to a lesser degree by noise, spatial variations, and perturbations in the acquired images. The identified relationships between CNN layers are exploited in order to account for lost spatial information that occurs in deeper layers. The method is claimed to have improved accuracy over the state-of-the-art of the time, which is consistent with the idea of ensuring robustness and low failure rates. Deeper features are more consistent and allow for stronger classification, while shallow features compensate for the detrimental effects of filtering and pooling. This allows for deep features to be better integrated into the spatial context of the images. On a similar note, in [22] features from multiple layers that individually constitute weak trackers are combined to form a stronger one, by means of a hedging algorithm. The practice of using multiple weak methods into a more effective one has significant potential and is based on the principle that each individual weak component contains some piece of meaningful information on the tracked object, while also having useless data mostly found in the form of noise. By appropriately combining the contributions of each weak component, a stronger one can be generated. As such, methods that exploit compound classifiers typically show robustness to variances of illumination, affine transforms, or camera shaking. The downside of such methods is that multiple groups of weak features are needed, which causes penalties in real-time response. Additionally, the fusion algorithm has its own performance-impacting overhead.

Alternative approaches exist which mitigate this to some extent. For example, the use of multiple sensors directly supplies the necessary data, as opposed to relying on multiple features computed from the same camera or pair of cameras. An example in this direction is provided in [23], where an image gallery from a multi-camera system is fed into a CNN in an attempt to solve multi-target multi-camera tracking and target re-identification problems. For correct and consistent re-identification, an observation in a specific image is matched against several ones from other cameras using correlations as part of a similarity metric. Such correlation among images from multiple cameras are learned during training and subsequently clustered to provide a unified agreement between them. Eventually, after a training process that exploits a custom triplet loss function, features are obtained to be further used in the identification process. In terms of performance, the method boasts substantial accuracy considering the multi-camera setup. The idea of compositing robust

features from a multi-faceted architecture is further exploited in works such as [24]. A triple-net setup is used to generate features that account for appearance, spatial cues, and temporal consistency.

#### 2.1.3. Ensuring Temporal Coherence

One of the most significant challenges for autonomous driving is accounting for temporal coherence in tracking. Nearly all automotive scenarios involve video and motion across multiple frames. Consequently, handling image sequence data and accounting for temporal consistency are key factors in ensuring successful predictions, accuracy, and reliability. Essentially, solving temporal tracking is a compound problem. On the one hand, it involves tracking objects in single images considering all the problems induced by noise, geometry and the lack of spatial information. On the other hand, it should ensure that the tracking is consistent across multiple frames. That is, assigning correct IDs to the same objects in a continuous video sequence.

This presents a lot of challenges, for instance when objects become occluded in some frames and are exposed in others. In some cases, the tracked objects suffer affine transformations across frames, of which rotation and shearing are notoriously difficult to handle. Additionally, the objects may change shape due to noise, aliasing and other acquisitionrelated artifacts that may be present in the images. Video is rarely if ever acquired at "high enough" resolution and is in many cases in some lossy compressed format. As such, the challenge is to identify features that are robust enough to handle proper classification and to ensure temporal consistency considering all pitfalls associated with processing video data. This often involves a "focus and context" approach: key targets are identified based on features learned from current frames and from the context of the tracked object. Processing a key frame in a video sequence provides the focus, while the information from previous frames form the context.

For this type of problem, one popular approach is to integrate recurrent components into the classifier, which inherently account for the context provided by a set of elements from a sequence. Neural networks with recurrent layers, such as long short-term memory (LSTM) and gated recurrent units (GRU), are commonly employed in the related literature for the processing of temporal data. When training and exploiting recurrent layers to classify sequences, the results from one frame carry over to the computations that take place for subsequent frames. As such, when processing the current frame, resulting detections also account for what was found in previous frames. For automotive applications, one advantage of neural networks is that they can be trained off-site, while the resulting model can be ported to the embedded device in the vehicle where predictions and tracking can occur at usable speeds. While training a recurrent network or multiple collaborating networks can be a lengthy process, forward-propagating new data can happen quite fast, making these algorithms a good choice for real-time tracking.

Another concept that consistently appears in the related literature is "historical matching". The idea is to carry over part of the characteristics of tracked objects across multiple frames, by building an affinity model from shape, appearance, positional, and motion cues. This is achieved in [25] using dual CNNs with multistep training, which handle appearance matching using various filtering operations and linearly composing the resulting features across multiple timestamps. The notion of determining and preserving affinity is also exploited in [26] where data consisting of frame pairs several timestamps apart are fed into dual VGG networks (models based on convolutional neural networks with an architecture designed for image recognition tasks). The resulting features are permuted and incorporated into association matrices that are further used to compute object affinities. This approach has the benefit of partially accounting for occlusion using only a limited number of frames, since the affinity of an object that is partially occluded in one frame may be preserved if it appears fully in the pair frame.

Ensuring the continuity of high-level features such as appearance models is not a trivial task, and multiple solutions exist. For example [27] uses a CNN modified with a discriminative component intended to correct for temporal errors that may accumulate in the appearance of tracked objects across multiple frames. Discriminative network behavior is also exploited in [28] where selectively trained dual networks are used to generate and correlate appearance with a motion stream. Also, decomposing the tracking problem into localization and motion using multiple component networks is a frequently-encountered solution, further exploited in works such as [29,30]. As such, using two networks that work in tandem is a popular approach and seems to provide accurate results throughout the available literature (Figure 2 in [30]).

In this context, siamese convolutional networks have the ability to learn similarities by comparing features from dual-stream convolutional layers. One example is provided by [31], where appearance and motion are handled by a combination of CNNs that work together within a unified framework. The motion component uses spotlight filtering over feature maps that result from subtracting features drawn from dual CNNs. A spaceinvariant feature map is then generated using pooling and fusion operations. The other component handles appearance by filtering and fusing features from a different arrangement of convolutional layers. Data from ROIs in the acquired images are passed to both components. Motion responses from one component are correlated with appearance responses from the other. Both components produce feature maps that are composed together to form space- and motion-invariant characteristics to be further used for target identification. As such, a common functionality of such models is to feed the similarities learned among different inputs to subsequent network components that carry out the classification/detection task [32,33].

Some authors take this concept further by employing several network components [34], each of which contributes features exhibiting specific and limited correlations. When joined together, the features form a complete appearance model of the tracked objects. Other approaches map network components to flow graphs, the traversal of which enables optimal cost-function and feature learning [35]. It is worthy of noting that the more complicated the architecture of the classifier, the more elaborate the training process and the poorer the performance. A careful balance should therefore be reached between the complexity of the classifier, the completeness of the resulting features, and the amount of processing and training data needed to produce high-accuracy results. All this should involve a computational cost consistent with the needs of automotive applications. For instance, in [36] the authors propose a lightweight solution where the feature extractor consists in only two convolutional layers, while a careful selection of motion patterns solves the data association problem.

In [37], the idea of object matching from frame pairs is further explored using a three-component setup: a siamese network configuration handles single object tracking and generates short-term cues in the form of tracklet images, while a modified version of GoogleNet Inception-v4 generates re-identification features from multiple tracklets. The third component is based on the idea that there may be a large overlap in the previouslycomputed features, which are consequently treated as switcher candidates. As a result, a switcher-aware logic handles the situation where IDs of different objects may be interchanged during frame sequences mainly as a result of partial occlusion.

As the difficulty of the tracking problem increases, so does the need to design systems capable of learning increasingly useful and robust features. In this sense, many solutions consist in models that extract features expressing increasingly abstract concepts, which have the potential for greater generalization. Therefore, a lot of effort is directed toward identifying object features that are higher-level, more abstract representations of how the object fits within the overall context of the acquired video sequence. Examples of such concept are the previously-mentioned "affinity"; another is "attention", where some authors propose neural-network-based solutions for estimating attention and generating attention maps. Reference [15] computes attention features that are spatially and temporally sound using an arrangement of ROI identification and pooling operations. Reference [38] uses attention cues to handle the inherent noise from conventional detection methods,

as well as to compensate for frequent interactions and overlaps among tracked targets. A two-component system handles noise and occlusion, and produces spatial attention maps by matching similar regions from pair frames. Temporal coherence is achieved by weighing observations across the trajectory differently, thereby assigning them different levels of attention. This process results is filtering criteria used to successfully account for similar observations while eliminating dissimilar ones. Another noteworthy contribution is [39], where attention maps are generated using reciprocative learning. The input frame is sent back-and-forth through several convolutional layers: in the forward propagation phase classification scores are generated, while the back-propagation produces attention maps from the gradients of the previously-obtained scores. The computed maps are further used as regularization terms within a classifier. The advantage of this approach is its simplicity compared to other similar ones. The authors claim that their method for generating attention features ensures long-term robustness. Other methods that use frame pairs and no recurrent components do not seem to work as well for very longterm sequences. Recently, attention mechanisms have been gaining significant ground for solving temporal consistency problems, since they allow the underlying model the freedom to weigh selective portions of a time-based sequence. Other noteworthy examples of works where attention mechanisms are incorporated into a CNN-based detector are [40,41].

#### 2.1.4. LSTM-Based Methods

Generally, methods that are based on non-recurrent CNN-only approaches are best suited to handle short scenes where quick reactions are required in a brief situation that can be captured in a limited number of frames. Various literature studies show that LSTM-based methods have more potential to ensure the proper handling of long-term dependencies while avoiding various mathematical pitfalls. One example in this sense is the "vanishing gradient" problem, which in practice manifests as a mis-trained network resulting in drift effects and false positives. Furthermore, handling long-term dependencies means having to deal with occlusions to a greater extent than in shorter term scenarios.

Most approaches combine various classifiers that handle spatial and shape-based classification with LSTM components that deal with temporal coherence. An early example of an RNN implementation is [42], which uses an LSTM-based classifier to track objects in time, across multiple frames (Figure 1 in [42]). The authors demonstrate that an LSTM-based approach is better suited to removing and reinserting candidate observations to account for objects that leave/reenter the visible area of the scene. This provides a solution to the track initiation and termination problem based on data associations found in features obtained from the LSTM layers. This concept is exploited further by [43] where various cues are determined to assess long-term dependencies using a dual LSTM network. One LSTM component tracks motion, while the other handles interactions, and the two are combined to compute similarity scores between frames. The results show that using recurrent components to handle lengthy sequences produces more reliable results than other methods based on frame pairs. Some implementations using LSTM layers focus on tracking-while-driving problems, which pose additional challenges compared to most established benchmarks using static cameras. As an alternative to solutions that involve creating models of vehicle behavior, Reference [44] circumvent the need for vehicle modeling by directly inputting sensor measurements into an LSTM network to predict future vehicle positions and to analyze temporal behavior. A more elaborate attempt is [45] where instead of raw sensor data, the authors establish several maneuver classes and feed maneuver sequences to LSTM layers in order to generate probabilities for the occurrence of future maneuver instances. Eventually, multiple such maneuvers can be used to construct the trajectory and/or anticipate the intentions of the vehicles.

Furthermore, increasing the length of the sequence increases accuracy and stability over time, up to a certain limit where the network saturates and no longer improves. A solution to this problem would be to split the features into multiple sub-features, followed by reconnecting them to form more coherent long-term trajectories. This is achieved in [46] where a combined CNN and RNN-based feature extractor generates tracklets over lengthy sequences. The tracklets are split on frames that contain occlusions. A recombination mechanism based on gated recurrent units (GRUs) recombines the tracklet pieces according to their similarities, followed by the reconstruction of the complete trajectory using polynomial curve fitting.

Some authors do further modifications to LSTM layers to produce classifiers that generate abstract high-level features, such as those found in appearance models. A good example in this sense is [47] where LSTM layers are modified to do multiplication operations and use customized gating schemes between the recurrent hidden state and the derived features. The newly-obtained LSTM layers are better at producing appearance-related features than conventional LSTMs, which excel at motion prediction. Where trajectory estimation is concerned, LSTM-based methods exploit the gating that takes place in the recurrent layers, as opposed to regular RNNs, which pass candidate features into the next recurrent iteration without discriminating between them. The filters inherently present in gated LSTMs have the potential to eliminate unwanted feature candidates which may represent unwanted trajectory paths. Candidates which eventually lead to correctly-estimated motion cues are maintained. Furthermore, LSTMs demonstrate an inherent capability to predict trajectories that are interrupted by occlusion events or by reduced acquisition capabilities. This idea is exploited in order to find solutions to the problem of estimating the layout of a full environment from limited sensor data, a concept referred to in the related literature as "seeing beyond seeing" [48]. Given a set of sensors with limited capability, the idea is to perform end-to-end tracking using raw sensor data without the need to explicitly identify high-level features or to have a pre-existing detailed model of the environment. In this sense, recurrent architectures have the potential to predict and reconstruct occluded parts of a particular scene from incomplete or partial raw sensor output. The network is trained with partial data and it is updated through a mapping mechanism that makes associations with an unoccluded scene. Subsequently, the recurrent layers make their own internal associations and become capable of filling in the missing gaps that the sensors have been unable to acquire. Specifically, given a hidden state of the world that is not directly captured by any sensor, an RNN is trained using sequences of partial observations in an attempt to update its belief concerning the hidden parts of the world. The resulting information is used to "unocclude" the scene that was initially only partially perceived through limited sensor data. Upon training, the network is capable of defining its own interpretation of the hidden state of the scene. The previously-mentioned result is elaborated upon by a group that includes the same authors [49]. A similar approach previously applied in basic robot guidance is extended for use in assisted driving. In this case, more complex information can be inferred from raw sensor input, in the form of occupancy maps. Together with a deep network-based architecture, these allow for predicting the probabilities of obstacle presence even in occluded portions within the field of view. In [50], the idea of using LSTM layers to process sensor data is depicted by modeling actor trajectories and activities based on the output of an arrangement on inertial sensors. The proposed neural network learns correlations among sensor outputs and consequently forms an inertial odometry model.

In more recent studies, authors tend to add supplementary processing stages to their LSTM-based models. This additional effort seems to stem from the need to generate and incorporate an increasingly-refined and abstract array of features into the tracking process. As tracking scenarios increase in complexity, the resulting problem space increases in size and dimensionality. This motivates the need for extending an LSTM-centered model by incorporating it into a broader system. An example of such an approach is [51], where sequential dependencies are handled by LSTM layers as in commonly the case in such works. While relying on convolutional feature maps in the initial phases of the tracking pipeline, there is an additional mechanism for preparing selection proposals for the LSTM layers to process. Additionally, the common problem of feature inadequacy and class imbalance in the learning phase is handled by a GAN-based stage where the candidate samples are augmented. It is worth mentioning that, as more and more layers of different types are

added to such a system, the reliability of the selected features may increase together with robustness to potential biases in the training data. However, at the same time, there is the risk that training and validating such a system may become a tedious, time consuming task. As the complexity increases, so does the need for extending the training data set and supplement the required computational resources. Other efforts in the direction of producing more usable features involve determining pedestrian intention as suggested by [52]. In this scenario, an LSTM model is used in conjunction with an intention filter to select suitable trajectory offset hypotheses so as to add to the reliability of the predicted result. The use of intention as a defining concept for features is also explored by [53], who enhance LSTM cells by introducing additional speed and correlation components. These components serve to model the more complex interactions required to define intention. In [54], instead of refining feature candidates using additional mechanisms, the authors choose to change the representation of the respective features. Specifically, LSTM layers are reconfigured and repurposed to handle multidimensional hidden states as opposed to the 1D vectors used traditionally. This increases the ability of such layers to accurately model spatial and temporal interactions among pedestrians. Conversely, in [55] the authors adapt an arrangement of LSTM layers to process sparse 3D data structures as opposed to changing internal data representation. The authors choose to model the interactions among pedestrians using graphs and, consequently, graph convolutional networks. Such systems still rely on LSTM layers to encode temporal dependencies. The spatial and sequence-related relationships among the tracked actors (represented as nodes) is modeled by determining connections in the form of graph edges [56,57].

#### 2.1.5. Miscellaneous Neural Network-Based Methods

An interesting alternative to conventional deep learning architectures is the use of GANs, as demonstrated in [58]. GANs train generative models and filter their results using a discriminative component. GANs are notoriously difficult to train, which is one of the reasons why they see seldom use in the related literature. In terms of tracking, GANs alleviate the need to compute expensive appearance features and minimize the fragmentation that typically occurs in more conventional trajectory prediction models. A generative component produces and updates candidate observations, of which the least updated are eliminated. The generative-discriminative model is used in conjunction with an LSTM component to process and classify candidate sequences. This approach has the potential to produce high-accuracy models of human behavior, especially group behavior. At the same time, it is significantly more lightweight than previously-considered CNN-based solutions.

Another "outlier" solution in the related literature is [59], one of the few efforts involving reinforcement learning for MOT applications. The proposed model is split into two parts: a predictive component based on a CNN, which treats pedestrian detections as agents and determines the displacement of a target agent from its initial location; a decision network, which uses the resulting predictions and detections within a deep reinforcement learning network where the actions among the agents and their environment are rewarded so as to maximize their shared utility. Consequently, the collaborative interactions of multiple agents are exploited in order to simultaneously detect and track them more effectively. Other driver-centric reinforcement learning-based solutions determine driving rules for collision avoidance by weighing vehicle paths against potential pedestrian trajectories [60].

#### *2.2. Other Techniques*

While the current state-of-the art methods for MOT are mostly neural network-based, there also exist a multitude of other approaches which exploit more traditional, unsupervised means of providing reliable tracking. Neural networks gained popularity in recent years due in no small part to the availability of more powerful hardware, particularly GPUs, which allowed for training models capable of handling realistic scenarios in a reasonable amount of time. Neural networks however have the downside of needing vast amounts

of reliable training data. Also, they require a lot of experimentation and trial-and-error before the right design and hyperparameter set is found for a particular scenario. There are, however, situations where training data may not be readily available in sufficient quantity and variety. Such cases call for a more straightforward design and a more intuitive model that can provide reliable tracking without necessarily requiring supervised learning. Neural network models are harder to understand in terms of how they function, and, while as deterministic as their non-neural network-based counterparts, are less intuitive and meant for use in a "black-box" manner. This is where other, more transparent methods come into place.

The tracking problem can be formulated similarly to the neural-network case: given a set of observations/appearances/segmented objects in multiple video frames, the task is to develop a means of determining relationships among these elements across the frames and to come up with a means of predicting their path. Various authors formulate this problem differently, for instance some methods involve determining tracklets in each frame and then assembling object trajectories in a full video sequence by combining tracklets from all or some of the frames [61]. Traditional, non-NN-based approaches, especially nonsupervised ones, generally formulate much more straightforward models. Some are based on a graph or flow-oriented interpretation of the tracked scene. Others rely on emitting hypotheses as to the potential trajectories of the tracked targets, or otherwise formulating some probabilistic approach to predicting the evolution of objects in time. It is worth noting that many of the more conventional, unsupervised algorithms from the related literature do not generalize the solution as well as a NN-based method. Consequently, they are usable in a limited number of scenarios, by comparison. Some works attempt to circumvent this problem using evolutionary algorithms as multicriteria optimization methods [62]. However, while capable of covering a significant portion of the problem space, such methods have the downside that the optimal trajectory needs to be periodically recalculated, which can hinder performance especially for on-board-only systems. Also, methods that attempt to account for temporal consistency do not handle time sequences as lengthy as, for instance, an LSTM network. The likely explanation is that an unsupervised method requires far more processing capabilities the more frame elements it is fed. In the case of a properly-trained neural network, the amount of computational resources required does not increase as much with the length of the associated sequence. However, in practice, especially on an embedded device as required in automotive tracking, porting a more conventional method may be more convenient in terms of implementation and platform compatibility than running a pre-trained NN model.

Another important aspect worth mentioning is that conventional methods are much more varied in terms of their underlying algorithms, as opposed to an NN-based architecture which features various arrangements of the same two or three neural network types, with additional processing of layer activations or outputs as the case may be. For this reason, we do not attempt to cover all the approaches ever developed for object tracking, but we rather focus on representative works featuring various successful attempts at MOT.

#### 2.2.1. Traditional Algorithms and Methods Focusing on High-Performance

The Kalman filter is a popular method with many applications in navigation and control, particularly with regard to predicting the future path of an object, associating multiple objects with their trajectories, while demonstrating significant robustness to noise. Generally, Kalman-based methods are used for simpler tracking, particularly in online scenarios where the tracker only accesses a limited number of frames at a time, possibly only the current and previous ones. An example of the use of the Kalman filter is [63], where a combination of the aforementioned filter and the Munkres algorithm as the mincost estimator is used in a simple setup focusing on performance. The method requires designing a dynamic model of the tracked objects' motion, and is much more sensitive to the type of detector. However the proper parameters are established, the simplicity of the method allows for significant real-time performance.

Similar methods are frequently used in simple scenarios where a limited number of frames are available and the detections are accurate. In such situations, the simplicity of the implementations allows for quick response times even on low-spec embedded client devices. In the same spirit of providing an easy, straightforward method that works well for simple scenarios, Reference [64] provides an approach based on bounding-box regression. Given multiple object bounding boxes from a set of ordered frames, the authors use a regression model to predict the positions of the objects' bounding boxes in following frames. An important restriction of such an approach is that it only successfully detects targets that move only slightly across consecutive frames, making it reliable in scenarios where the frame rate is high enough and relatively stable. Furthermore, a reliable detector is a must in such situations, and crowded scenes with frequent occlusion events are not handled properly. As with the previous approach, this is well suited for easy cases where robust image acquisition is available and performance and implementation simplicity are a priority. Unfortunately, noisy images are fairly common in automotive scenarios where, for efficiency and cost reasons, a compromise may be made in terms of the quality and performance of the cameras and sensors. It is often desirable that the software be robust to noise so as to minimize the hardware costs.

In Reference [65], tracking is done by a particle filter for each track. The authors use the Munkres assignment for bounding boxes within consecutive images for each track. A cost matrix is then generated based on the associations made among bounding boxes from current and previous images. Specifically, the cost of associating two bounding boxes is determined from the Euclidean distance between the centers of the boxes, as well as their size variation. This approach is simple to implement, but the assignment algorithm has an *O*(*n*3) complexity, which is likely too high for real-time tracking.

Various attempts exist for improving noise robustness while maintaining performance, for example in [66]. In this case, the lifetime of tracked objects is modeled using a Markov Decision Process (MDP). The policy of the MDP is determined using reinforcement learning, whose objective is to learn a similarity function for associating tracked objects. The positions and lifetimes of the objects are modeled using transitions between MDP states. Reference [67] also use MDPs in a more generalized scheme, involving multiple sensors and cameras and fusing the results from multiple MDP formulations. Note that Markov models can be limiting when it comes to automotive tracking, since a typical scene with multiple interacting targets does not exhibit the Markov property where the current state only depends on the previous one. In this regard, the related literature features multiple attempts to improve reliability. Reference [68] propose an elaborate pipeline featuring multiview tracking, ground plane projection, maneuver recognition, and trajectory prediction. The method involves an assortment of approaches, which include Hidden Markov Models and Variational Gaussian mixture models. Such efforts show that an improvement over traditional algorithms involves sequencing together multiple different methods, each with its own role. As such, there is the risk that the overall resulting approach may be too fragmented and too cumbersome to implement, interpret, and improve properly.

Works such as [69] attempt to circumvent such limitations by proposing alternatives to tried-and-tested Markov models, in this case in the form of a system that determines behavioral patterns in an effort to ensure global consistency for tracking results. There are multiple ways to exploit behavior in order to guide the tracking process. For instance, a possible solution would rely on learning and minimizing/maximizing an energy function that associates behavioral patterns with potential trajectory candidates. This concept is exemplified by [70], who propose a method based on minimizing a continuous energy function aimed at handling the very large space of potential trajectory solutions. A limited, discrete set of behavior patterns impose limitations on the energy function. While such a limitation offers better guarantees that a global optimum will eventually be reached, it may not allow for a complete representation of the system.

An alternative approach which is also designed to handle occlusions is [71], where the divide-and-conquer paradigm is used to partition the solution space into smaller subsets, thereby optimizing the search for the optimal variant. The authors note that while detections and their respective trajectories can be extracted rather efficiently from crowded scenes, the presence of ambiguities induced by occlusion events may raise significant detection errors. The proposed solution involves subdividing the object assignment problem into subproblems, followed by a selective combination of the best features found within the subdivisions (Figure 3 in [71]). The number and types of the features are variable, thereby accounting for some level of flexibility for this approach. One particular downside is that once the scene changes, the problem itself also changes and the subdivisions need to reoccur and update, therefore making this method unsuitable for scenes acquired from moving cameras.

A similar problem is posed in [61], where it is also noted that complex scenes pose tracking difficulties due to occlusion events and similarities among different objects. This issue is handled by subdividing object trajectories into multiple tracklets and subsequently determining a confidence level for each such tracklet, based on its detectability and continuity. Actual trajectories are then formed from tracklets connected based on their confidence values. One advantage of this method in terms of performance is that tracklets can be added to already-determined trajectories in real-time as they become available without requiring complex processing or additional associations. Additionally, linear discriminant analysis is used to differentiate objects based on appearance criteria. The concept of appearance is more extensively exploited by [72], who use motion dynamics to distinguish between targets with similar features. They approach the problem by determining a dynamics-based similarity between tracklets using generalized linear assignment. As such, targets are identified using motion cues, which are complementary to more well established appearance models. While demonstrating adequate performance and accuracy, it is worth mentioning that motion-based features are sensitive to camera movement and are considerably more difficult to use in automotive situations. Motion assessment metrics that work well for static cameras may be less reliable when the cameras are in motion and image jittering and shaking occur.

The idea of generating appearance models using traditional means is exemplified in [73], who use a combination appearance models learned using a regularized least squares framework and a system for generating potential solution candidates in the form of a set of track hypotheses for each successful detection. The hypotheses are arranged in trees, each of which are scored and selected according to the best fit in terms of providing usable trajectories. An alternative to constructing an elaborate appearance model is proposed by [74], who directly involve the shape and geometry of the detections within the tracking process, therefore using shape-based cost functions instead of ones based on pixel clusters. Furthermore, results focusing on tracking-while-driving problems may opt for a vehicle behavior model, or a kinematic model, as opposed to one that is based on appearance criteria. Examples of such approaches are [75–77], where the authors build models of vehicle behavior from parameters such as steering angles, headings, offset distances, and relative positions. Note that kinematic and motion models are generally more suited to situations where the input consists in data from radar, Light Detection and Ranging (LiDAR) or Global Positioning Systems (GPS), as opposed to image sequences. In particular, attempting to reconstruct visual information from LiDAR point clouds is not a trivial task and may involve elaborate reconstruction, segmentation and registration preprocessing before a suitable detection and tracking pipeline can be designed [78].

Another class of results from related literature follows a different paradigm. Instead of employing complex energy minimization functions and/or statistical modeling, other authors opt for a simpler, faster approach that works with a limited amount of information drawn from the video frames. The motivation is that in some cases the scenarios may be simple enough that a straightforward method that alleviates the need for extended processing may prove just as effective as more complex and elaborate counterparts. An example in this direction is [79] whose method is based on scoring detections by determining overlaps between their bounding boxes across multiple consecutive frames. A

scoring system is then developed based on these overlaps and, depending on the resulting scores, trajectories are formed from sets of successive overlaps of the same bounding boxes. Such a method does not directly handle crowded scenes, occlusions or fast moving objects whose positions are far apart in consecutive frames, however it may present a suitable compromise in terms of accuracy in scenarios where performance is detrimental and the embedded hardware may not allow for more complex processing. This is in contrast to high-performance methods that use on-board hardware to provide a lot of the information required for tracking, therefore reducing the reaction time of the underlying system in high-speed scenarios [80]. An additional important consideration for this type of problem is how the tracking method is evaluated.

Most authors use a common, established set of benchmarks which, while having a certain degree of generality, cannot cover every situation that a vehicle might be found in. As such, some authors such as [81] devote their work to developing performance and evaluation metrics and data sets, which allow for covering a wide range of potential problems that may arise in MOT scenarios. As such, the choice in the method used for tracking is as much a consequence of the diversity of situations and events claimed to be covered by the method, as it results from the evaluation performed by the authors. For example, as was the case for NN-based methods, most evaluations are done for scenes with static cameras, which are only partly relevant for automotive applications. The advantage of the methods presented thus far lies in the fact that they generally outperform their counterparts in terms of the required processing power and computational resources, which is a plus for vehicle-based tracking where the client device is usually a low-power solution. Furthermore, some methods can be extended rather easily, as the need may be, for instance by incorporating additional features or criteria when assembling trajectories from individual detections, by finding an optimizer that ensures additional robustness, or, as is already the case with some of the previously-mentioned papers, by incorporating a light-weight supervised classifier in order to boost detection and tracking accuracy. Additionally, the problem of false or malicious information from other traffic participants (for example in a multi-vehicle situation) has the potential to affect the accuracy of such methods. One proposed solution to this issue is to cluster the observations drawn from cooperative tracking according to their reliability and their potential to adversely affect the tracking results [82].

#### 2.2.2. Methods Based on Graphs and Flow Models

A significant number of results from the related literature present the tracking solution as a graph search problem or otherwise model the tracking scene using a dependency graph or flow model. There are multiple advantages to using such an approach: graph-based models tailor well to the multi-tracking problem since, like a graph, it is formed from inter-related nodes each with a distinct set of parameter values. The relationships that can be determined among tracked objects or a set of trajectory candidates can be modeled using edges with edge costs. Graph theory is well understood and graph traversal and search algorithms can be widely found, with implementations readily available on most platforms. Likewise, flow models can be seen as an alternative interpretation of graphs, with node dependencies modeled through operators and dependency functions, forming an interconnected system. Unlike a traditional graph, data from a flow model progresses in an established direction that starts from initial components where acquired data are handled as input; the data then traverse intermediate nodes where they are processed in some manner and end up at terminal nodes where the results are obtained and exploited. Like graphs, flow models allow for loops that implement refinement techniques and in-depth processing via multiple local iterations.

Most methods that exploit graphs and flow models attempt to solve the tracking problem using a minimum path or minimum cost-type approach. An example in this sense is [83], where multi-object tracking is modeled using a network flow model subjected to min-cost optimization. Each path through the flow model represents a potential trajectory, formed by concatenating individual detections from each frame. Occlusion events are modeled as multiple potential directions arising from the occlusion node and the proposed solution handles the resulting ambiguities by incorporating pairwise costs into the flow network.

A more straightforward solution is presented by [84], who solve multi-tracking using dynamic programming and formulate the scenario as a linear program. They subsequently handle the large number of resulting variables and constraints using k-shortest paths. One advantage of this method seems to be that it allows for reliable tracking from only four overlapping low resolution low fps video streams, which is in line with the costeffectiveness required by automotive applications.

Another related solution is [85], where a cost function is developed from estimating the number of potential trajectories as well as their origins and end frames. Then, the scenario is handled as a shortest-path problem in a graph, which the authors solve using a greedy algorithm. This approach has the advantage that it uses well-established methods, therefore affording some level of simplicity to understanding and implementing the algorithms.

In [86], a similar graph-based solution divides the problem into multiple subproblems by exploring several graph partitioning mechanisms and uses greedy search based on Adaptive Label Iterative Conditional Modes. Partitioning allows for successful disassociation of object identities in circumstances where said identities might be confused with one another. Also, methods based on solution space partitioning have the advantage of being highly scalable, therefore allowing fine tuning of their parameters in order to achieve a trade-off between accuracy and performance. Multiple extensions of the graph-based problem exist in the related literature, for instance, when multiple other criteria are incorporated into the search method. Reference [87] incorporate appearance and motion-based cues into their data association mechanism, which is modeled using a global graph representation and makes use of generalized minimum clique graphs to locate representative tracklets in each frame. Among other advantages, this allows for a longer time span to be handled, albeit for each object individually.

Another related approach is provided in [88], where the solution consists in a collaborative model which makes use of a detector and multiple individual trackers, whose interdependencies are determined by finding associations with key samples from each detected region in the processed frames. These interdependencies are further exploited via a sample selection method to generate and update appearance models for each tracker.

As extensions of the more traditional graph-based models that use greedy algorithms to search for suitable candidate solutions and update the resulting models in subsequent processing steps, some authors handle the problem using hypergraphs. These extend the concept of classical graphs by generalizing the role of graph edges. In a conventional graph an edge joins two nodes, while in a hypergraph edges are sets of arbitrary combinations of nodes. Therefore an edge in a hypergraph connects to multiple nodes, instead of just two as in the traditional case. This structure has the potential to form more extensive and complete models using a singular unified concept and to alleviate the need for costly solution space partitioning or subdivision mechanisms. Another use of the hypergraph concept is provided by [89], who build a hypergraph-based model to generate meaningful data associations capable of handling the problem of targets with similar appearance and in close proximity to one-another, a situation frequently encountered in crowded scenes. The hypergraph model allows for the formulation of higher-order relationships among various detections, which, as mentioned in previous sections, have the potential to ensure robustness against simple transformations, noise, and various other spatial and temporal inaccuracies. The method is based on grouping dense neighborhoods of tracklets hierarchically, forming multiple layers which enable more fine-grained descriptions of the relationships that exists in each such neighborhood. A related but much more recent result [90] is also based on the notion that hypergraphs allow for determining higher order dependencies among tracklets, but in this case the parameters of the hypergraph edges are learned using a structural support vector machine (SSVM), as opposed to being

determined empirically. Trajectories are established as a result of determining higher order dependencies by rearranging the edges of the hypergraph so as to conform to several constraints and affinity criteria (Figure 1 in [90]). While demonstrating robustness to affine transforms and noise, such methods still cannot handle complex crowded scenes with multiple occlusions and, compared to previously-mentioned methods, suffer some penalties in terms of performance, since updating the various parameters of hypergraph edges can be computationally costly.

#### *2.3. Discussion*

Most of the results from the available literature focus on generating abstract, highlevel features of the observations found in the processed images, since, generally, the more abstract the feature, the more robust it should be to transformations, noise, drift, and other undesired artifacts and effects. Most authors rely on an arrangement of CNNs where each component has a distinct role in the system, such as learning appearance models, geometric and spatial patterns, or learning temporal dependencies. It is worth noting that a strictly CNN-based method needs substantial tweaking and careful parameter adjustment before it can accomplish the complex task of consistent detection in space and across multiple frames.

LSTM-based architectures seem to show more promising results for ensuring longterm temporal coherence, since this is what they were designed for, while also being simpler to implement and train. For the purposes of autonomous driving, an LSTM-based method shows promise, considering that training should happen offline and that a heavilyoptimized solution is needed to achieve a real-time response. Designing such a system also requires a fair amount of trial-and-error since currently there is no well established manner to predict which network architecture is suited to a particular purpose.

One particularly promising direction for automotive tracking are solutions that make use of limited sensor data and that are able to efficiently predict the surrounding environment without requiring a full representation or reconstruction of the scene. These approaches circumvent the need for lengthy video sequences, heavy image processing and the computation of complicated object features while being especially designed to handle occlusion and objects outside of the immediate field of view. As such, where automotive tracking is concerned, the available results from the state-of-the art seem to suggest that an effective solution would make use of partial data while being able to handle temporal correlations across lengthy sequences using an LSTM component.

Other, unsupervised approaches not reliant on neural networks offer a more straightforward model with an easier implementation. The downside often consists in the lack of generalization that such systems are capable of. The features required for detection and stability often have to be manually established, as opposed to supervised methods that can learn meaningful features on their own. The choice in terms of the most useful and reliant tracking model ultimately rests on many factors, among which we mention: the size and complexity of the problem, the availability of training data, the available computational resources, and, ultimately, the requirements in terms of accuracy, coherence, and stability.

We summarize our findings in Table 1, where we classify the works referenced in this Section according to the main method and the general approach used throughout. We choose to feature distinctive categories for solutions relying mainly on convolutional layers and on recurrent ones. Other methods are grouped into their own category. This choice is motivated by the fact that, as of yet, deep neural networks consistently show the most promise for the problems described throughout this Section. Many authors have found inventive and effective solutions to tracking problems using neural network-based models, since they offer the most robust features while being natively designed to solve focus-and-context problems in data sequences.


**Table 1.** Classification of tracking solutions from existing works.

#### **3. Trajectory Prediction Methods**

In order to navigate through complex traffic scenarios safely and efficiently, autonomous cars should be able to predict the way in which the other traffic participants will behave in the near future with a sufficient degree of accuracy. The prediction of their motion is especially difficult because there are usually multiple interacting agents in a scene. Also, driver behavior is multi-modal, e.g., in different situations, from a certain common past trajectory, several different future trajectories may emerge. An autonomous car must also find a balance between the safety of people involved (its own passengers and other human drivers, or pedestrians) and choosing an efficient speed to reach its destination, without any perturbations to existing traffic. Predicting the future state of its environment is particularly important when the autonomous vehicle should act proactively, e.g., when changing lanes, overtaking other traffic participants and managing intersections [45].

Other difficulties come from the requirement that such a system must be prepared to handle rare, exceptional situations. However, because of the great number of possibilities involved, it should take into account only a reasonable subset of possible future scene evolutions [92] and often, it is important to identify the most probable solution [93].

Reasoning about the intentions of other drivers is a particularly helpful ability. Trajectory prediction can be treated on two different levels of abstraction. On the higher level, one can identify the overall intentions regarding a discrete set of possible actions, e.g., changing a lane or moving left or right in an intersection. On the lower level, one can predict the actual continuous trajectories of the road users [94].

Trajectory prediction needs to be precise but also computationally efficient [95]. The latter requirement can be satisfied by recognizing some constraints that reduce the size of the problem space. For example, the current speed of a vehicle affects its stopping time or the allowed curvature of its future trajectory so as to maintain the stability of the vehicle. Even if each driver has his/her own driving style, it is assumed that traffic rules will be obeyed, at least to some extent, and this will constrain the set of possible future trajectories [93].

A recent white paper [96] states that a solution for the prediction and planning tasks of an autonomous car may consider a combination of the following properties:


Further, Reference [96] asserts that the self-driving car system should be prepared not only for the worst-case illegal behavior of the other traffic participants, but also for their worst-case legal behavior. The prediction system should be able to learn what the "reasonable" conduct of the other drivers may be in various circumstances. This may also depend on local conditions, such as different "driving cultures" in different countries.

#### *3.1. Problem Description*

To tackle the trajectory prediction task, one needs to have access to real-time data from sensors such as LiDAR, radar, or camera, and to a functioning system that allows detection and tracking of traffic participants in real-time. Examples of pieces of information that describe a traffic participant are: the bounding box, position, velocity, acceleration, heading, and yaw rate, i.e., the change in the heading angle. It may also be needed to have mapping data of the area where the ego car is driving, i.e., road and crosswalk locations, lane directions, and other relevant map-related information. Past and future positions are represented in an ego car-centric coordinate system. Also, one needs to model the static context with road and crosswalk polygons, as well as lane directions and boundaries [97]. An example of available information on which the prediction module can operate is presented in Figure 1.3 in [98].

More formally, prediction can be defined as reasoning about probable outcomes based on past observations [99]. Let *X<sup>i</sup> <sup>t</sup>* be a vector with the spatial coordinates of agent *i* at observation time *t*, with *t* ∈ {1, 2, ..., *Tobs*}, where *Tobs* is the present time step in the series of observations. The past trajectory of agent *<sup>i</sup>* is a sequence *<sup>X</sup><sup>i</sup>* <sup>=</sup> {*X<sup>i</sup>* <sup>1</sup>, *<sup>X</sup><sup>i</sup>* <sup>2</sup>, ..., *<sup>X</sup><sup>i</sup> Tobs*}. Based on the past trajectories of all agents, one needs to estimate the future trajectories of all agents, i.e., *<sup>Y</sup>*ˆ*<sup>i</sup>* <sup>=</sup> {*Y*ˆ*<sup>i</sup> Tobs*+1,*Y*ˆ*<sup>i</sup> Tobs*+2, ...,*Y*ˆ*<sup>i</sup> Tpred* }.

It is also possible to first generate the trajectories in the Frenet coordinate system along the current lane of the ego vehicle, and then convert it to the Cartesian coordinate system [93]. The Frenet coordinate system is useful to simplify the motion equations when cars travel on curved roads. It consists of longitudinal and lateral axes, denoted as *s* and *d*, respectively. The curve that goes through the center of the road determines the *s* axis and indicates how far along the car is on the road. The *d* axis indicates the lateral displacement of the car. *d* is 0 on the center of the road and its absolute value increases with the distance from the center. Also, it can be positive or negative, depending on the side of the road.

#### *3.2. Classification of Methods*

There are several classification approaches presented in the literature regarding trajectory planning methods.

An online tutorial [100] distinguishes the following categories:


A survey [101] proposes a different classification based on three increasingly abstract levels:


The higher the level of abstraction of a prediction model, the more computationally expensive the model tends to become. Therefore, algorithms have been proposed that focus only on the most plausible trajectories. Also, the performance of the prediction methods are highly coupled with risk estimation possibilities. Therefore, the authors of [101] consider that successful approaches in this area should consider both vehicle motion modeling and risk estimation.

A classification somewhat similar with the previous two is mentioned in [102], which distinguishes the following motion prediction categories of methods:


In the rest of this section, we present some specific approaches classified by their main prediction "paradigm", namely neural networks and other methods, most of which use some kind of stochastic representation of the agents' behavior in the environment. This is especially useful since some works use the same model to address different abstraction levels of the trajectory prediction task.

#### *3.3. Methods Using Neural Networks*

Many of the approaches presented in the literature that are based on neural networks use either recurrent neural networks (RNNs), which explicitly take into account a history composed of the past states of the agents, or simpler convolutional neural networks (CNNs). Other authors use conditional variational autoencoders (CVAEs) or more recent methods such as generative adversarial networks (GANs) and attention mechanisms.

A generative system is DESIRE [99], which has the goal of predicting the future locations of multiple interacting agents in dynamic (driving) scenes. It can handle the multi-modal nature of the prediction, i.e., for the same set of inputs, the predicted outputs may have several distinct values (a one-to-many mapping). It also takes into account the scene context and the interactions between traffic participants. It uses a single end-to-end neural network model, which the authors report to be computationally efficient. Using a deep learning framework, DESIRE can rank and refine the set of generated trajectories by considering the long-term future values, i.e., the sum of discounted rewards.

The corresponding optimization problem tries to maximize the potential future reward of the prediction, using the following mechanisms (Figure 2 in [99]):


In [103], a method to predict the trajectories of the neighboring traffic participants is proposed using a long short-term memory (LSTM) network, with the goal of taking into account the relationship between the ego car and surrounding vehicles. The LSTM is a type of recurrent neural network (RNN) capable of learning long-term dependencies. Generally, an RNN has a vanishing gradient problem. An LSTM is able to deal with this through a forget gate, designed to control the information between the memory cells in order to store the most relevant previous data. The proposed method considers the ego car and four surrounding vehicles. It is assumed that drivers generally pay attention to the relative distance and speed with respect to the other cars when they intend to change a lane. Based on this assumption, the relative amounts between the target and the four surrounding vehicles are used as the input of the LSTM network. The feature vector **x***<sup>t</sup>* at time step *t* is defined by twelve features: lateral position of target vehicle, longitudinal position of target vehicle, lateral speed of target vehicle, longitudinal speed of target vehicle, relative distance between target and preceding vehicle, relative speed between target and preceding vehicle,

relative distance between target and following vehicle, relative speed between target and following vehicle, relative distance between target and lead vehicle, relative speed between target and lead vehicle, relative distance between target and ego vehicle, and relative speed between target and ego vehicle. The input vector of the LSTM network is sequence data with **x***t*'s for past time steps. The output is the feature vector at the next time step *t* + 1. A trajectory is predicted by iteratively using the output result of the network as the input vector for the subsequent time step.

In [44], an efficient trajectory prediction framework is proposed, which is also based on an LSTM. This approach is data-driven and learns complex behaviors of the vehicles from a massive amount of trajectory data. The LSTM receives the coordinates and velocities of the surrounding vehicles as inputs and produces probabilistic information about the future positions of the traffic participants on an occupancy grid map (Figure 1 in [44]). The proposed method is reported to have better prediction accuracy than Kalman filtering.

The occupancy grid map is widely adopted for probabilistic localization and mapping. It reflects the uncertainty of the predicted trajectories. In [44], the occupancy grid map is constructed by partitioning the range under consideration into several grid cells. The grid size is determined such that a grid cell approximately covers a quarter of a lane to recognize the movement of the vehicles on the same lane, as well as the lengths of the vehicles (Figure 3 in [44]).

When predictions are needed for different time ranges, e.g., Δ = 0.5, 1, 2 s, the LSTM is trained independently for each time range. The LSTM produces the probability of occupancy for each grid cell. Let (*x*, *y*) be the identifier of a cell in the occupancy grid. Then the softmax layer in LSTM *i* computes the probability *Po*(*ix*, *iy*) for the grid element (*ix*, *iy*).

Finally, the outputs of the *<sup>n</sup>* LSTMs are combined using *Po*(*ix*, *iy*) = <sup>1</sup> <sup>−</sup> *<sup>n</sup>* ∏ *i*=1 <sup>1</sup> <sup>−</sup> *<sup>P</sup>*(*i*) *<sup>o</sup>* (*ix*, *iy*) .

The probability of occupancy *Po*(*ix*, *iy*) summarizes the prediction of the future trajectory for all *n* vehicles in the single map.

Alternatively, the same LSTM architecture can be used to directly predict the coordinates of a vehicle as a regression task. Instead of using the softmax layer to compute probabilities, the system can produce two real coordinate values *x* and *y*.

In [45], another LSTM model is described for interaction-aware motion prediction. Confidence values are assigned to the maneuvers that are performed by vehicles. Based on them, a multi-modal distribution over future motions is computed. More specifically, the model computes probabilities for each type of maneuver, based on six maneuver classes. The input to the LSTM is represented by the past positions of the ego car and its neighbors, and the geometry of the road lanes.

Social LSTM [104], used for predicting the trajectory of pedestrians, uses LSTM with a social pooling layer which allows neighbors, up to a certain distance, to exchange information. The hidden states of their corresponding LSTMs are pooled together and used as an input for the following prediction step.

Taking into account the time constraints of a real-time system, Reference [97] uses a simple feed-forward CNN architecture for the prediction task. The authors use an RGB image to represent the scene context. However, a vector of velocity, acceleration, and yaw rate can also be included. In this case, this vector is concatenated with the flattened output of the CNN. Then, these aggregated features are sent to a fully connected layer.

A similar approach is used in [18], which predicts multiple possible trajectories together with their probabilities. The context is also encoded as an image that is passed to a CNN. Given the raster image and the state estimates of agents at a time step, the CNN is used to predict a multitude of possible future state sequences, as well as the probability of each sequence.

As part of a complete software stack for autonomous driving, NVIDIA created a system based on a CNN, called PilotNet [105], which outputs steering angles given images of the road ahead. This system is trained using road images paired with the steering angles generated by a human driving a car that collects data. The authors identified the elements of the road image that have the greatest effect on the steering decision. It seems that in addition to learning the obvious features such as lane markings, edges of roads and other cars, the system learns more subtle features that would be hard to anticipate and program by engineers, e.g., bushes lining the edge of the road and atypical vehicle classes, while ignoring structures in the camera images that are not relevant to driving. This capability is derived from data without the need of hand-crafted rules.

In [94], the authors propose a learnable end-to-end model with a deep neural network that reasons about both high level behavior and long-term trajectories. Inspired by how humans perform this task, the network exploits motion and prior knowledge about the road topology in the form of maps containing semantic elements such as lanes, intersections, and traffic lights. The so-called IntentNet is a CNN that outputs three types of variables in a single forward pass: the detection scores for vehicle and background classes, the action probabilities corresponding to the discrete intentions, and bounding box regressions in the current and future time steps representing the intended trajectory. This design enables the system to propagate uncertainty through the different components and is reported to be computationally efficient.

A sequence-to-sequence CNN architecture is also used in [95] for an end-to-end trajectory prediction model. The authors say that the results are comparable to those of other, more complex approaches using LSTMs. Trajectory histories are embedded by means of a fully connected layer. Stacked convolutional layers are used to learn temporal dependencies in a consistent manner. Then, the features from the final convolutional layer are passed through a fully-connected layer to simultaneously generate all predicted positions. The authors report that the results when only one time step at a time is predicted are worse than the results when all future times are predicted at the same time.

The CoverNet model [106] uses a CNN in combination with a trajectory set generated from the input state containing, e.g., speed, acceleration, and yaw rate. The image features pass though some fully-connected layers and produce probabilities for each mode using softmax.

The Y-net model [107] uses the U-Net architecture [108] for the semantic segmentation of the input image. It also computes a distribution for the future trajectories, where the sampled points are clustered using k-means [109].

The TraPHic model [110] uses a hybrid LSTM-CNN network. The trajectory information is passed through LSTMs to construct three maps for the horizon, neighbors, and ego vehicle. The first two are further passed through different CNNs and concatenated with the ego car tensor. The resulting latent representations are passed through another LSTM to predict the ego trajectory.

The EvolveGraph approach [111] uses graphs to model the behavior of heterogeneous agents. It proposes an observation graph, fully connected, to represent the agents in the scene, and an interaction graph for the agent–agent and agent–context interactions. It also employs an encoder–decoder technique, with the encoder using softmax for edge classification and the decoder generating a Gaussian mixture distribution for prediction.

The TNT model [112] uses VectorNet [113], a hierarchical graph neural network, to encode the context of a scene, including road lanes and the position of the traffic signs, beside the trajectories of the agents. The generated set of trajectories is finally filtered to reject similar instances.

Generative approaches are also used. For example, PRECOG [114] follows the idea of identifying high-level goals and condition the predictions based on those. It employs CNNs and RNNs, and also a generative model where the latent variables stand for plausible behavior of the agents in a scene. Reference [115] uses a so-called "conditional flow" variational autoencoder (CF-VAE) that can handle multi-modal conditional distributions. PECNet [116] conditions the predicted trajectories on their endpoints with a conditional variational autoencoder (CVAE) and proposes the "truncation trick", i.e., truncating the sampling distribution with a smaller standard deviation for cases with a few samples to increase the diversity for multi-modal prediction.

Several trajectory prediction models employ Generative Adversarial Networks (GANs) [117]. This architecture has two components: a generator and a discriminator. Instead of training the generator model to directly match the desired data distribution, in this case the generator is trained so that it increases the error rate of the discriminator. In turn, the discriminator tries to distinguish whether a given sample belongs to the true data distribution or is generated by the generator. Both components are engaged in a competition to outsmart the other one, and from this process the generator learns to generate data that resemble the true data distribution. In the domain of trajectory prediction, Social GAN [118] uses a GAN where the generator is composed of an LSTM-based encoder, a context-pooling module, and an LSTM-based decoder. The discriminator uses LSTMs as well.

Other models employ GANs in conjunction with attention mechanisms. AEE-GAN [119] uses attention in order to alleviate the issues given by the complexity of a scene with many heterogeneous interacting agents. For trajectory encoding, it also uses LSTMs. A characteristic feature is the enhanced attention module containing two components: one for recurrent visual attention enforcement (RVAE) and one for social enforcement (SE). The results of the RVAE are visualized with the Grad-Cam method [120], which creates a heatmap with the attention weights of the image pixels.

Another GAN-based architecture is Social Ways [121], which uses three types of losses: discrimination loss for the discriminator, adversarial loss for the generator, and information loss for both. SoPhie [122] uses a GAN module together with a feature extractor module composed of a CNN and several LSTMs encoders, and an attention module with two components: physical attention and social attention.

Attention mechanisms are also used with techniques other than GANs. MHA-JAM [123] uses a CNN for the transformation of the input image and LSTMs for trajectory encoding, whose outputs then pass to several attention heads that provide the data for the LSTM decoders.

Other authors rely on methods to handle graphs explicitly. For example, DAG-NET [124] uses an attention-enhanced graph neural network (GNN) together with a recurrent variational encoder (RVAE) composed of a variational autoencoder (VAE) and a recurrent neural network (RNN). Multiverse [125] is also based on a graph attention network that is used by a convolutional recurrent neural network (ConvRNN). Unlike other approaches, it uses an occupancy grid for a coarse-grained prediction, which is further refined by a fine-grained prediction. Graphs are also employed by Trajectron++ [126], where a scene is represented as a spatio-temporal graph in which nodes denote the agents and edges denote their interactions. A local map is processed by a CNN, trajectories are encoded with LSTMs, multi-modal solutions are handled by means of a CVAE, and the trajectory decoders are based on gated recurrent units (GRUs).

P2TIRL [127] uses similar techniques, i.e., attention, GRU, CNN, but it conditions trajectories by means of a policy learned with inverse reinforcement learning (IRL) on a grid that represents the scene.

#### *3.4. Methods Using Stochastic Techniques*

The authors of [128] use Partially Observable Markov Decision Processes (POMDPs) for behavior prediction and nonlinear receding horizon control, or model predictive control, for trajectory planning. The POMDP models the interactions between the ego vehicle and the obstacles. The action space is discretized into: acceleration, deceleration, and maintaining the current speed. For each of the obstacle vehicles, three types of intentions are considered: going straight, turning, and stopping. The reward function is chosen so that the agents make the maximum progress on the road while avoiding collisions. A particle filter is implemented to update the belief of each motion intention for each obstacle vehicle. For the ego car, the bicycle kinematic model is used to update the state.

Article [129] presents a method to predict trajectories in dense city environments. The authors recorded the trajectories of cars comprising over 1000 h of driving in San Francisco and New York. By relating the current position of an observed car to this large dataset of previously exhibited motion in the same area, the prediction of its future position can be directly performed. Under the hypothesis that the car follows the same trajectory pattern as one of the cars in the past at the same location had followed. This non-parametric method improves over time as the amount of samples increases and avoids the need for more complex models.

Paper [93] presents a trajectory prediction method that combines the constant yaw rate and acceleration (CYRA) motion model with maneuver recognition. The maneuver recognition module selects the current maneuver from a predefined set (e.g., keep lane, change lane to the right or to the left, and turn at an intersection) by comparing the center lines of the road lanes to a local curvilinear model of the path of the vehicle. The proposed method combines the short-term accuracy of the former technique and the longer-term accuracy of the latter. The authors use mathematical models that take into account the position, speed, and acceleration of vehicles.

In [130], a method is presented that evaluates the probabilistic prediction of real traffic scenes with varying start conditions. The prediction is based on a particle filter, which estimates the behavior-describing parameters of a microscopic traffic model, i.e., the driving style as a distribution of behavior parameters. This method seems to be applicable for long-term trajectory planning. The driving style parameters of the intelligent driving model (IDM) are continuously estimated, together with the relative motion between objects. By measuring vehicle accelerations, a driving style estimation can be provided from the first detection without the need of a long observation time before performing the prediction. By using a particle filter, it is possible to handle continuous behavior changes with arbitrarily shaped parameter distributions. Forward propagation using Monte Carlo simulation provides an approximate probability density function of the future scene.

In first-order Markov models, a state prediction depends only on the previous observed state, therefore, if the set of past trajectories has common subsequences, the quality of future predictions may be poor. An additional problem is that the data obtained from sensors can be affected by occlusions. The approaches based on Gaussian processes (GPs) overcome this problem by modeling motion patterns as velocity flow fields and provide good performance in the presence of noise. Another advantage is that the predictions have a simple analytical form, and this can be used to assess the risk in traffic scenarios.

As the traffic participants have a mutual influence on one another, their interaction is explicitly considered in [102], which is inspired by an optimization problem. For motion prediction, the collision probability of a vehicle performing a certain maneuver is computed. The prediction is performed based on the safety evaluation and the assumption that drivers avoid collisions. This combination of the intention of each driver and the driver's local risk assessment to perform a maneuver leads to an interaction-aware motion prediction. The authors compute the probability that a collision will occur anywhere in the whole scene, considering that the number of different maneuvers is limited (e.g., lane changes, acceleration, maintaining the speed, deceleration, and combinations), and then the proposed system assesses the danger of possible future trajectories.

The same concept of considering risk is used in [92], which applies a Bayesian approach combined with maneuver-based trajectory prediction. First, a collection of highlevel driving maneuvers is assessed for each vehicle with inference in the Bayesian network that models the traffic scene. Then, maneuver-based probabilistic trajectory prediction models are employed to predict the configuration of each vehicle forward in time. The proposed system has three main parts: the maneuver detection, the prediction, and the criticality assessment. In the last part, the individual joint distributions are used together with a parametric free space map-based representation of the environment with probability distribution functions to estimate the probability of a collision between the ego car and any of its neighbors within the prediction horizon via Monte Carlo simulation.

The authors of [68] propose a framework with three interacting modules: a trajectory prediction module based on a motion-based interaction model combined with maneuverspecific variational Gaussian mixture models, a maneuver recognition module based on hidden Markov models (HMMs) for assigning confidence values for maneuvers being performed by surrounding vehicles, and a vehicle interaction module that handles the context of the scene and assigns final predictions by minimizing an energy function based on outputs of the other two modules. The paper defines ten maneuver classes defined by combinations of lane passes, overtakes, cut-ins, and drifts into the ego lane. A corresponding energy minimization problem is set so that the predictions where cars come too close to one another are penalized.

#### *3.5. Mixed Methods*

The authors of [131] use a model-based approach relying on vehicle kinematics and an assumption that drivers plan trajectories in such a way as to minimize an unknown cost function. They introduce an IRL algorithm to learn the cost functions of other vehicles in an energy-based generative model. Langevin sampling, a Monte Carlo-based sampling algorithm, is used to directly sample the control sequence. Langevin sampling is shown to generate better predictions with higher stability. It seems that this algorithm is more flexible than standard IRL methods, and can learn higher-level, non-Markovian cost functions defined over entire trajectories. The cost functions are extended with neural networks in order to combine the advantages of both model-based and model-free learning. The study uses both environment structure, in the form of kinematic vehicular constraints, which can be modeled very accurately, and the assumption that human drivers optimize their trajectories according to a subjective cost function.

Multiple deep neural network architectures are designed to learn the cost functions, some of which augment a set of hand-crafted features. The human-crafted cost functions are defined as ten components: the distance to the goal, the distance to the center of the lane, the penalty of collision to other vehicles (inversely proportional to the distance to other vehicles), the L2-norm of acceleration and steering, the L2-norm for the difference of acceleration and steering between two frames, the heading angle to lane, and the difference to the speed limit.

The application of deep learning and mixture models for the prediction of human drivers in traffic is investigated in [132]. The chosen approach is a mixture density network (MDN) where the neural model has LSTM units and the mixture model consists of univariate Gaussian distributions. It applies multi-task learning, in that by sharing the representation between multiple tasks, one enables the model to generalize better. A limitation is that the tasks usually have to be related to some extent. For example, a single neural network can predict both longitudinal and lateral accelerations from the same input, where the first few layers in the network are shared between the two tasks, and then separated into two different layers to produce the final outputs. To capture the intention of the driver, another layer is used in parallel to the motion prediction layer after the LSTM layers. This layer indicates if the driver intends to switch lane and remain there within the next four seconds.

Another algorithm is the Predictron [133]. This architecture is an abstract model based on a Markov reward process, which can be rolled forward for a series of "imagined" planning steps. The predictron is trained end-to-end with the objective that the accumulated values computed in each forward pass should approximate the true value function. It is reported to demonstrate more accurate predictions than conventional deep neural network architectures.

The Monte Carlo Tree Search (MCTS) [134] algorithm can also be used in the context of trajectory planning. It simulates the possible future trajectories starting from the current state, then it evaluates the performance of the leaves using an evaluation function, e.g., a "value network", and finally it uses these evaluations to update the internal values along the trajectory. The architecture presented in [135], called MCTSnet, incorporates the simulation-based search into a neural network, working with vector embeddings. Its advantage is that gradient-based optimization can be used to train the network end-to-end. However, internal action sequences directing the control flow of the network cannot be differentiated. To address this, an approximate method for credit assignment is proposed that allows to learn this part of the search network from data.

#### *3.6. Discussion*

We summarize the works and their specific techniques presented in the trajectory prediction section in Table 2. The table is sorted by publication year in order to give the reader an impression about the overall progress in this field.

The datasets that were used as benchmarks by the papers were also included. We must mention that all authors report experimental results on some kind of datasets, e.g., driving data specifically collected in some areas of the world or synthetic data collected from simulators. However, only the publicly available, real-world datasets were included in the table.

Information about the general capabilities of the methods was included in terms of the ability to provide multi-modal predictions and whether the social context, i.e., the other agents in the scene, was taken into account. Here, we only mention the approaches that handle the interactions and the context explicitly, e.g., with some kind of pooling mechanism or graph representation, not those that just consider an image as the input, which implicitly contains graphical depictions of all agents.

Some of the works also predict the trajectories of pedestrians, not only vehicles. However, we do not distinguish between these case studies, but only mention the main methods which can be used in both situations.


#### **Table 2.** Overview of trajectory prediction solutions.


**Table 2.** *Cont*.

In general, many authors use CNNs to process the graphical inputs, e.g., camerabased images or maps, and LSTMs for trajectory encoding and decoding. Because of

the constraints of real-time requirements, some works also use CNN architectures for prediction [97]. They seem to be able to model complex relations and capture spatial correlations in the data [136]. Some papers state that they are also competitive in modeling temporal data [95], with performance comparable to that of the LSTMs, but with a much simpler internal structure. Multi-modal predictions are often made with some kind of generative models such as CVAE. The methods based on CNNs seem to be more lightweight and fast than those containing LSTM and CVAE components. Still, a large number of approaches combine these techniques in some way.

Other works employ more recent techniques such as GANs and graph representations in conjunction with neural networks. Attention mechanisms also seem promising to distinguish the important features in the context of a complex scene with many interacting agents.

The data themselves may cause difficulties, because a network only learns what is present in the data, and hopefully generalizes well, but there may be situations where the humans do not behave according to previous observations. This is one drawback of using neural networks. However, it seems that the advantages of using data-driven approaches outperform the disadvantages.

Many methods that belong to the stochastic paradigm try to estimate the probabilities of discrete maneuvers. When using, e.g., hidden Markov models, the movement of the traffic participants is evaluated independently, an assumption which is true only for simple scenarios. Gaussian Process regression can quantify uncertainty, but it is also limited in its ability to model complex interactions. For this purpose, other techniques such as Bayesian networks can be used instead, with the disadvantage of an increased computation time and thus a difficulty in handling real-time learning tasks [97].

Although it is possible to do multi-step prediction with a Kalman filter, it cannot be extended far into the future with reasonable accuracy. A multi-step prediction done solely by a Kalman filter was found to be accurate up until 10–15 time steps, after which the predictions diverged and ended up being worse than constant velocity inference [132]. This emphasizes the advantages of data-driven approaches, as it is possible to observe almost an infinite number of variables that may all affect the driver, whereas the Kalman filter relies solely on the physical movement of the vehicle.

Another approach is to learn policies in a supervised way, e.g., imitation learning. The cost function of a human driver can be estimated with inverse reinforcement learning and then a policy can be extracted from the cost function [136]. However, this may again be inefficient for real-time applications [97].

In multi-agent contexts such as those defined by traffic scenarios, since an agent's actions depend on the other agents' actions, uncertainty can propagate to future states with the consequence that an agent completely stops because all possible actions are deemed as unacceptably unsafe. This is known as the "freezing-robot" problem. Deadlock avoidance and multi-objective decision making are very common in practice, e.g., in autonomous robotics [137–139].

Finally, it should be mentioned that in this section, we have addressed the trajectory prediction problem. A related, but distinct problem, is trajectory planning, i.e., finding an optimal path from the current location to a given goal location. Its aim is to produce smooth trajectories with small changes in curvature, so as to minimize both the lateral and the longitudinal acceleration of the ego vehicle. For this purpose, there are several methods reported in the literature, e.g., using cubic spline interpolation, trigonometric spline interpolation, Bézier curves, or clothoids, i.e., curves with a complex mathematical definition, which have a linear relation between the curvature and the arc length, and allow smooth transitions from a straight line to a circle arc or vice versa. Deep reinforcement learning methods [140,141] such as policy gradients [142], deep Q-network [143], actor-critic [144], asynchronous advantage actor-critic [145], proximal policy optimization [146], trust region policy optimization [147], imagination-augmented agents [148], or proximal gradient temporal difference learning [149] can also be used to decide the possible maneuvers that the ego car can make in order to optimize criteria related to risk and efficiency.

#### **4. Conclusions**

Learning-based approaches have basically become the norm for autonomous driving problems. Although explicit rule-based methods may have an important advantage in the form of explicit knowledge, hand-crafted rules usually take a considerable amount of effort to devise and validate, and usually do not have satisfactory generalization capabilities because of the great variability of situations that may appear in a driving context. Unfortunately, techniques based on learning typically require large quantities of data in order to cover a sufficiently large part of the space of possible driving behaviors.

Because they capture the generative structure of vehicle trajectories, model-based methods can potentially learn more from fewer data than model-free methods. However, good cost functions are challenging to learn, and simple, hand-crafted representations may not generalize well across tasks and contexts. In general, model-based methods can be less flexible and may underperform model-free methods in the limit of infinite data. Model-free methods take a data-driven approach, aiming to learn predictive distributions over trajectories directly from data. These approaches are more flexible and require less knowledge engineering in terms of the type of vehicles, maneuvers, and scenarios, but the amount of data they require may be very large.

The past three decades have seen increasingly rapid progress in driverless vehicle technology. In addition to the advances in computing and perception hardware, this rapid progress has been enabled by major theoretical progress in computational aspects. Autonomous cars are complex systems that can be decomposed into a hierarchy of decision making problems, where the solution of one problem is the input to the next. The breakdown into individual decision making problems has enabled the use of well-developed methods and technologies from a variety of research areas.

This literature review has concentrated only on two aspects: tracking and trajectory prediction. It can serve as a reference for assessing the computational tradeoffs between various choices for algorithm design.

**Author Contributions:** Writing—Sections 1 and 2, M.G., Sections 3 and 4, F.L.; funding acquisition, F.L. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by Continental AG within the *Proreta 5* project.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Acknowledgments:** We kindly thank Continental AG for their great cooperation within *Proreta 5*, which is a joint research project of the Technical University of Darmstadt, University of Bremen, "Gheorghe Asachi" Technical University of Ia¸si and Continental AG.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**

