4.2.1. 1980s–2015

A very early model following the disruptive approach for an end-to-end ego path planning was suggested by Pomerleau in 1989 [85]. This neural network based model received much attention with a novel approach using camera and laser data as input, outputting the road curvature which was followed for lane centering. The network contained one hidden layer only and completed training in just half an hour. Due to the simplicity of the framework, laser data was found to play only a minor role. Pomerleau recognized the importance of this study with the concluding remark: "We certainly believe it is important to begin researching and evaluating neural networks in real world situations, and we think autonomous navigation is an interesting application for such an approach" [85]. Further work picked up on this approach, applying fully connected neural networks, but reducing the problems to lane change decisions [41,86]. These models were mainly trained on simulated data.

It was recognized that the motion of an object followed characteristic patterns. Therefore, unsupervised clustering techniques were applied for classification of these trajectory types. The advantage of this technique is that it allows for a long-term prediction of the trajectory, given that the correct maneuver class was identified.

Vasquez and Fraichard [87] followed this approach by applying a pairwise clustering algorithm based on a dissimilarity measure on simulated and real data. After calculating the mean value and standard deviation of the clusters, the likelihood was estimated that an observed trajectory belongs to a cluster. This paper considers pedestrian motion patterns. The observed problems are that an early prediction is associated with a high error, and individual aspects of the trajectories could not be taken into account. Li et al. [88] used a multi-stage approach on clustering techniques. The algorithms were optimized by a strategy of refinement, first creating general clusters which are then processed by a second clustering algorithm.

Hermes et al. [59] make use of the radial basis functions, which were used as a priori probability for trajectory prediction. Among the highlighted references, this study was found to be the first with a quantitative evaluation of experimental results on trajectory prediction. 24 test trajectories from real drives were evaluated regarding the RMSE. Although their model did not outperform the standard model on straight trajectories, it showed strong improvement for curves.

HMMs are also used for scene prediction. The input is a graph with nodes containing the object states and edges as transitions between the states. The initial state is a stochastic representation of the states, the transition probabilities are then evaluated iteratively in discrete time steps. Hidden refers to the fact that the object states are not directly observable. The process then contains two steps: determination of structure (number of nodes + edges) and parameters (state prior, transition and observation probabilities) which are learned. Often clustering techniques are used to determine the structure of the HMM. Vasquez et al. [89] use growing HMM, meaning that their model can change its structure. The data was taken on parking spots, resulting in a model error on the range of meters.

In 2013, a dynamic approach by Houenou et al. [60] received quite some attention. The Constant Yaw Rate and Acceleration (CYRA) model, using constant yaw rate and acceleration, selects the best trajectory which minimizes a cost function depending on acceleration and maneuver duration. The model achieves a very low mean displacement error, but it was evaluated on their own collected dataset only and is thus not comparable to other models.

#### 4.2.2. 2016: The Rise of Deep Learning Techniques

In 2016 and following years, a shift in techniques is observable. The number of papers using neural networks strongly increases and a more systematic evaluation of object positions is observable.

The objects' states can be represented as sequences of discrete historical data spanning usually a few seconds, e.g., **X** = (*Xt*−Δ*T*, *Xt*−Δ*T*+1, ..., *X*−1, *X*0), where *xi* stands for position, velocity or any other physical quantity at a given time *t* within the observed time frame Δ*T*. The goal of the model is to predict the state variables at the end of the observation period. At inference, this time should lie some seconds in the future. Time series can be described well with simple RNN or LSTM networks. Therefore, it is not surprising that many authors jumped on this trend after the first models appeared in the community. The first LSTM models were still used for classification of maneuvers [54,90] but prediction horizons up to 10 s suddenly seemed within reach [65]. The simplest LSTM models are lacking a description of mutual interactions of the traffic participants which was then approached by representing the objects on an occupancy grid [91] as visualized in Figure 10 (right).

This drawback was furthermore addressed in the following more complex models. By introducing a concatenation layer, in which the information from individual LSTMs is brought together, the context information could be modeled. The term "social pooling" became popular [66,92], describing the interdependencies of all agents in the scene, which was originally developed for human trajectory prediction [93–95]. Furthermore, multi-modality came into focus, in which possible outcomes of the same initial context were modeled. Either of these aspects became a standard ingredient for most of the following models [62,64,96,97]:

The DESIRE framework [72] models the trajectories of each agent in RNN encoders, which are then concatenated and combined with the scene context from a standard convolutional layer. They found that a 2 s history is enough to predict 4 s ahead. It is one of the first papers that propose an RNN-based encoder–decoder structure. The RNN encoder captures the motion pattern of the observation period and determines a latent representation of the temporal data. This latent representation can now easily be concatenated with other data, e.g., feature maps resulting from a CNN that processes an image representation of the static scene context. Finally, either the original latent representation of the RNN encoder or a concatenated tensor is used as input for the RNN-based decoder, which predicts the future waypoints of vehicles in a recurrent and auto-regressive fashion. If the latent representation from the encoder is extended by additional information, the decoder is able to condition its prediction not only on the previous motion of a certain vehicle but also takes into account further information which enables much more accurate predictions.

Another feature that has become very popular in trajectory prediction during the last few years is the Conditional Variational Autoencoder framework [33,72]. The basic idea for using this kind of generative model in trajectory prediction, is to capture uncertainty, ambiguity, and multi-modality. Due to the lacking of knowledge about the intention and inner state of drivers controlling the surrounding vehicles, the future trajectory of those vehicles can only be estimated but not precalculated exactly. In many situations given a specific motion of a vehicle during the observation period, different future evolutions of that trajectory are possible and in accordance to the behavior of other traffic participants and the traffic rules in general. Such situations require more than a single prediction to better capture the entire probability distribution over the future trajectory. Therefore, generative models are able to predict multiple possible trajectory predictions.

Such an approach was followed by Xu et al. [98], who worked on large scale crowdsourced video data, using an LSTM temporal encoder fused with a fully convolutional visual encoder and a semantic segmentation as side task. Further works combining an LSTM encoder–decoder structure with semantic information followed, the semantic information in form of a maneuver classification as input for trajectory modeling [99] or combination of LSTM and CNN layers [67,68]. Park et al. combined their encoder–decoder model with a beam search algorithm to keep the *k* best models as trajectory candidates [63].

Time series can also be forecasted well with the CNNs. The input to such a network is similar to LSTMs, a sequence of state representations. The historical context information is filtered with the 1-dimensional temporal CNN kernel, with its size determining the length of the time window. Some authors favor this approach [100], since CNNs are easier to train than LSTMs. A model using a pure CNN model was realized by Luo et al. [101], who used a 4D tensor as input to their model where only one dimension is reserved for the temporal information, and the three remaining dimensions contain spatial information. 3D point clouds measured by the sensors were processed in a tracking model and assigned to an occupancy grid. The output of their model is a detection of bounding boxes for *n* timesteps. The authors claim that their model is more robust to occlusion.

To summarize, at this stage models were aiming at better imitating the human behavior.

#### 4.2.3. GNNs, Attention and New Use Cases

A new trend in representation of the acting agents for neural network models, were graph models [102–109]. Describing the connections between objects in a graph structure in opposition to an occupancy grid, is a huge advantage if there are only sparse connections between the objects. The input to such a neural network is a graph as visualized in Figure 9. The nodes represent the objects, with feature vectors holding, e.g., positions, velocities etc., and the edges are the connections between the objects. This reflects interactions

between vehicles and the interdependencies in their motion naturally and allows for efficient implementation as well as calculation. The information aggregation step (similar to message passing in graphical models) in such GNN models can also be considered as the approximation of negotiation processes that result from manual driving of humans sharing the available space or even the same lane on the street. The input is usually turned to an embedding which then serves for further evaluation with RNN- or CNN-based models. An unusual approch was taken by Yu et al. [110], who processed time information in a spectral graph convolution. Later models usually work on the time data itself, e.g., [111]. Since 2019 the models were getting very complex with a lot of combinations of different building blocks.

In 2019, attention models were introduced in the field of trajectory prediction [112–117]. In [113,118,119], the attention module is used to prevent to pre-define an exact graph structure of a given traffic situation. Instead all the traffic participants are considered and the attention modules determines the attention weights that correspond to the degree one vehicle determines the motion of another one while predicting the vehicle's trajectories at the same time. Since usually the attention weights are calculated based on the hidden states of an RNN-based encoder, attention is not able to focus on the spatial dimension by means of the position of those vehicle but also on the temporal dimension because the attention weights are calculated for every single time step. Therefore, it learns to give more importance to the relevant latent features and thus, pushed the performance of the interaction-aware models.

Later, graph convolutions were used as input to the attention model, where the attention was given to both, spatial and temporal data [69,111]. Tang et al. [69] stress the importance of multi-modality in closed-form in opposition to Monte Carlo models which are not capable of this. They claim the strong achievement of their contribution being a model that is scalable for any number of agents in a scene.

Additionally, a few first transformer models have entered the field of scene prediction, currently with the focus more on human traffic prediction [120,121] or long-term traffic flow (time interval on several minutes) [122]. As in Natural Language Processing (NLP) transformers partially begin to challenge RNN-based sequence-to-sequence models in (human) trajectory too, demonstrating the importance of the attention module in those tasks. Different from NLP, where the order of words in a sentence may change without modifying its meaning, the order of trajectory points cannot be changed without entirely confusing its plausibility. However, the successive processing of the time series input data gets lost when transformers are used for trajectory prediction in contrast to RNN-based models. This could be one reason why Transformers are still relatively rare in the landscape of trajectory prediction models while RNNs continue to be used as an integral part of most approaches. At the same time, this lack of sequence information provides the benefit of reduced computation time due to the possibility of parallelization.

Multi-modality was also brought more into focus by Chai et al. [74]. The authors used a 3D fully convolutional network in space and time, combined with static information for all agents, which generates a feature map. In a second state, this information is evaluated in an agent-centric network, outputting anchors for trajectory per agent. The output is in form of a bi-variate Gaussian model, which gives different weights to the samples to better approximate the distribution. A similar approach for the urban use case was followed by Phan-Minh et al., where map information was used as additional input [80].

A shift towards urban use cases was observable [76,78,82,123,124]: the difficulties are the variety of traffic participants and shorter time scales. A work that especially focuses on heterogeneous traffic is the 4D LSTM approach by Ma et al. [125]. Casas et al. [79] use 17 binary channels as input to a CNN for encoding different semantics, combining this information with a graphical representation of the agents' states and map information. Further semantics were included in the paper by Chandra et al. [71], by predicting if surrounding vehicles are overspeeding or underspeeding. This is used as additional input to the network, as well as regularization methods based on spectral clustering. Li et al. [83]

use a graph double attention network with a kinematic contstraint layer on top for assessing the physical feasibility of predicted maneuvers. Model robustness was tested for missing sensor input and found performing well by Mohta et al. [84].

Recent advances try to integrate the planning step into the traffic prediction. A standard pipeline takes the motion forecasts from scene prediction as output and ego planning is done separately, decoupled from the forecasting step. The new approach takes a hypothetical ego trajectory and integrates this information in the multi-agent forecasting. Such a coupled approach was used in [70], taking as input the future planning of the ego vehicle and past trajectories of the agents in the proximity of the ego vehicle. The output is a distribution of likely trajectories. Using a LSTM encoder to capture the temporal structure with social pooling, abstraction of interactions is done in a fusion module based on CNN architecture. The same group [77] suggested a different model with a top layer with explicit kinematic constraints by working in a curvilinear frame. Using a 1D CNN for temporal encoding and several LSTMs as well as attention for interaction modeling, their model is robust to missing information due to occluded objects. The problem with such approaches is still the real-time performance since the integrated task requires a lot of computation.

A completely different approach, for which we do not want to go into the details but nevertheless mention, is imitation learning. The idea is that a (driving) policy shall be learned that imitates the actions of an expert. The interested reader can refer to, e.g., [126,127].

#### **5. Public Datasets**

First, a quick overview of publicly available datasets which are most frequently used for training and evaluation of scene prediction models is given. The purpose of the training data in the context of learning-based models is obvious. After training (and validation) a separate test dataset is needed to evaluate the trained models. This enables advanced analysis, continuous development, and a comparison to competitive and baseline approaches. Therefore, datasets are an essential piece in the entire development pipeline for designing and engineering scene prediction models.

The Next Generation SIMulation (NGSIM) dataset [128] was recorded in California on the highways US 101 and I-80. Data collection was done from the top of nearby buildings through a set of synchronized cameras with a sampling rate of 10 Hz. Thus, the exact location of each object is known.

The Karlsruhe Institute of Technology and Toyota Technological Institute (KITTI) dataset [129] was recorded in Germany on rural roads, highways, and inner-cities. It contains several hours of traffic scenarios, taken with stereo cameras and 3D laser scanner mounted on a measurement vehicle.

The Argoverse dataset [130] was recorded in the United States, in Pittsburgh and Miami. It contains more than 30,000 scenarios sampled at a rate of 10 Hz with 360◦ camera and Lidar, which were aligned with map information. The data is already split into training, validation and test sets, each with a prediction horizon of 3 s.

The nuScenes dataset [131] was recorded in Boston and Singapore (left- and righthanded traffic) with a sensor system of 6 cameras, 5 radars, 1 Lidar, IMU and GPS. Furthermore, it provides detailed map information. Each of the 1000 recorded scenes has a length of 20 s. It contains 3D manually labeled bounding boxes for several object classes, with additional annotations regarding the scene conditions and human activity and pose information. This dataset is thus useful for urban use cases.

#### **6. Discussion**

The dominance of deep learning models for fully self-consistent scene prediction is obvious when comparing the methods in Table 1. Only few approaches based on models other than Neural Networks can be found among the papers with highest impact in the field. To note is that it is not possible to quantitatively compare the model results which have been evaluted only on private datasets [59–64]. Especially the very low root mean squared error of 0.45 m in the paper of Houenou et al. [60] is difficult to evaluate. The model performance depends heavily on the scenarios that are evaluated. Straight highway driving with little traffic is a simple task and should be tackled well by any model, whereas on the other hand curvy roads with lots of traffic participants doing multiple maneuvers are very challenging.

For the standardized datasets in Table 1, the models all use either LSTM-, CNN- or Attention-based models. Additionally, here, due to the different time horizons, only a qualitative performance estimate is given. There exist predefined test datasets only for Argoverse and nuScenes, which are thus the only datasets allowing a quantitative model performance comparison. An intra-dataset comparison shows that the Attention Network gives especially good performance [69,75,76].

Please note that at the time of submission, the Argoverse leaderboard [132] has a minimum average displacement error of 0.7897 m. Therefore, the result of minADE = 0.73 m in [76] is questionable.

It is also interesting to see that the same model sometimes performs well on one dataset but rather poorly on another, e.g., for Tang et al. [69] with a RMSE of 3.80 m on the NGSIM dataset being among the top results and the minADE = 1.40 m on the Argoverse dataset is just an average result.

One observes a competition of model improvement which is taking place on the cm scale by now. However, it is questionable if time and resources are invested at the right place since such performance improvements might not make a difference when deploying a scene prediction model in the final AD architecture. It is rather important to understand the differences in the model performance on different datasets. The distribution of cases could be different, one dataset containing many situations of straight highway driving with little traffic, while the other dataset contains dense traffic with many corner cases and unpredictable situations. However, it could be an issue in the model itself that is trimmed to a latent feature in a specific dataset. Therefore more effort should be spent on this question.

#### **7. Conclusions and Outlook**

It is difficult to estimate the performance of some earlier model results since they have often been evaluted on private datasets. The model performance depends heavily on the constitution of scenarios. Maneuvers and scenarios are not standardised, thus datasets with long intervals of straight highway driving will give better performance than highly crowded or curvy scenarios. Future research will benefit if methods are evaluated on public datasets, whose number will hopefully grow, since they facilitate inter-model comparisons.

Often a significant difference in the displacement error can be observed when comparing the model performance on two different datasets. This poses questions regarding the generalisation of models: if a model was developed on a public dataset, how will it perform in a different environment or with a new sensor model? This should be tackled by increasing the amount of data on which a model is tested. Realistic simulations can support testing but here one is facing the challenging task of modeling a realistic environment.

A further issue, which is addressed seldomly in the referenced publications, is online learning. Updating a model on the fly can be dangerous since the model cannot be tested rigorously. Nevertheless, an algorithm should be running in parallel which can detect anomalies for further analysis and collects data for updating the model in a postprocessing step. This way, corner cases can be identified. Anomaly detection is furthermore necessary in order to protect against adversarial attacks [60]. Guaranteeing a safe system is one of the most central research questions these days. Everyday huge amount of data is collected which can be used for model updates. Even large cloud storages will reach their limitation. Thus, it is important to wisely select data for storing, most efficiently in a preprocessing step already in the vehicle before sending the data to the cloud.

In the context of efficient data usage, research opportunities are to be found in strategies such as transfer learning and active learning. Transfer learning makes use of already well performing models from a different domain which can be fine-tuned for the specific task. It was shown initially for computer vision applications that this approach leads to amazing performances [133]. Active learning is an incremental learning process, which is especially useful if data is sparse. The data is rated on a trained model concerning its information content, which is then used for model updates [134].

Scene prediction models based on neural networks often use a combination of many different architectures, i.e., a graphical representation, recurrent models and attention as well as convolutions for semantic information. For urban use cases map information is often integrated. The observed trend towards a complete scene prediction is not far from the disruptive approach. The argument against the disruptive approach was that such an algorithm cannot be tested for functional safety. Thus, it is necessary to think about the degree of complexity that is still manageable for testing.

When scene prediction models are robust enough, the next step will be the industrialisation of the product for a broad target market. Currently required sensor equipment is cost-intensive, using HD (high density) maps, camera, radar, Lidar, GPS and many more, and often can fill up the entire trunk of a car. Strategies for a minimalization of the costs and material require new approaches such as discretization of neural networks for deployment in embedded systems [135,136]. Alternative strategies follow using only a subset of the sensors, e.g., neglecting the costly Lidar technology, while still aiming at a comparable perception performance. The product costs eventually determine the target group of the self-driving vehicle. If it is not affordable for persons with a regular income, the business model will address high-value customers and mobility as a service.

A further streamline in the deep learning community is the development of a The so-called weak artificial intelligence is limited to specific tasks. A strong or general AI shall be able to handle multiple tasks, similar to a human [137]. Fitting into this picture, the latest approaches focus on a coupling of prediction and planning, as presented in the previous section. For such models, the functional safety aspect can only be satisfied if this is done in a multi-stage approach. From the complex model intermediate results need to be extracted which can be interpreted and evaluated, such as optimization of implicit layers [138].

**Author Contributions:** Conceptualization, A.S.N. and M.K.; writing—original draft preparation, A.S.N., M.K., M.S. and T.B.; writing—review and editing, A.S.N., M.K. and M.S.; visualization, A.S.N.; supervision, A.S.N. and T.B. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Data Availability Statement:** Not applicable.

**Acknowledgments:** Personal acknowledgements from A.S.N. to Reinhard Schlickeiser: This paper is written for the occasion of Reinhard's 70th birthday who has been a great supervisor during my, Anne Stockem Novo, early career. As a PhD student in Reinhard's group he taught me the joy of analytical calculations, wrangling equations and function approximations. Because of their excellent scientific network I came into contact with highly recognized researchers from the computer simulation domain with whom I made first contact with programming. The analytical mindset and critical thinking that I developed under Reinhard's supervision, were the basis for a new challenge in industry at the ZF group in algorithm development for automated driving with machine learning techniques, a field not directly connected with my previous research. This paved the way for becoming a professor for applied artificial intelligence.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**


#### **Appendix A. Long Short-Term Memory Networks**

There are several gates in which the information is processed. The *forget gate* determines which information shall be kept in memory by adapting the weight matrices *Wf* , *Uf* and bias *Bf* ,

$$f\_t = \sigma \left[ \mathcal{W}\_f \mathcal{X}\_t + \mathcal{U}\_f \mathcal{h}\_{t-1} + \mathcal{B}\_f \right]. \tag{A1}$$

The non-linear activation function is usually a sigmoid function, *σ*(*x*) = 1/(1 + exp(−*x*)), limiting the output *ft* to the range [0, 1].

Similarly, the input gate determines which information shall be added to the longterm memory,

$$i\_t = \sigma \left[ \mathcal{W}\_i X\_t + \mathcal{U}\_i \mathfrak{h}\_{t-1} + B\_i \right] \tag{A2}$$

with corresponding weight matrices *Wi* and *Ui*, as well as bias *Bi*.

The output gate receives similar input,

$$
\rho\_t = \sigma[\mathcal{W}\_o \mathcal{X}\_t + \mathcal{U}\_o h\_{t-1} + B\_o]. \tag{A3}
$$

One can also identify a simple RNN cell, which usually applies tanh as activation function,

$$\vec{\omega}\_t = \tanh[\mathcal{W}\_\mathbf{c} \mathbf{X}\_t + \mathcal{U}\_\mathbf{c} \mathbf{h}\_{t-1} + \mathcal{B}\_\mathbf{c}].\tag{A4}$$

The long-term memory is then calculated as

$$c\_t = f\_t \cdot c\_{t-1} + i\_t \cdot \vec{c}\_t \tag{A5}$$

A combination of the forget gate, *ft*, and input gate, *it*, as well as the long-term state vector of the previous time step, *ct*−1, and the output of the simple RNN cell at the current time step, *c*˜*t*.

The short-term memory and cell output are updated with the output gate, *ot*, and long-term state vector, *ct*,

$$h\_t = o\_t \cdot \tanh(c\_t). \tag{A6}$$

Additionally, in this case, the output to the following layer, *yt*, is identical to the state vector, *ht*.

#### **Appendix B. Gated Recurrent Unit**

The GRU is a simpler form with only two gates. The original GRU architecture [31] consists of a reset gate,

*rt* = *σ*[*WrXt* + *Urht*−<sup>1</sup> + *Br*] (A7)

with weight matrices *Wr* and *Ur*, as well as an update gate,

$$z\_t = \sigma \left[ \mathcal{W}\_z X\_t + \mathcal{U}\_z h\_{t-1} + B\_z \right] \tag{A8}$$

with weight matrices *Wz* and *Uz*. The hidden state vector is calculated as

$$h\_t = z\_t \cdot h\_{t-1} + (1 - z\_t) \cdot \tilde{h}\_t \tag{A9}$$

with

$$\tilde{h}\_t = \tanh[\mathcal{W}\_h \mathcal{X}\_t + \mathcal{U}\_h (r\_t \cdot h\_{t-1}) + B\_h]. \tag{A10}$$

#### **References**

