*Review* **A Review on Scene Prediction for Automated Driving**

**Anne Stockem Novo 1,2,\*, Martin Krüger 3,4, Marco Stolpe <sup>4</sup> and Torsten Bertram <sup>3</sup>**


**Abstract:** Towards the aim of mastering level 5, a fully automated vehicle needs to be equipped with sensors for a 360◦ surround perception of the environment. In addition to this, it is required to anticipate plausible evolutions of the traffic scene such that it is possible to act in time, not just to react in case of emergencies. This way, a safe and smooth driving experience can be guaranteed. The complex spatio-temporal dependencies and high dynamics are some of the biggest challenges for scene prediction. The subtile indications of other drivers' intentions, which are often intuitively clear to the human driver, require data-driven models such as deep learning techniques. When dealing with uncertainties and making decisions based on noisy or sparse data, deep learning models also show a very robust performance. In this survey, a detailed overview of scene prediction models is presented with a historical approach. A quantitative comparison of the model results reveals the dominance of deep learning methods in current state-of-the-art research in this area, leading to a competition on the cm scale. Moreover, it also shows the problem of inter-model comparison, as many publications do not use standardized test sets. However, it is questionable if such improvements on the cm scale are actually necessary. More effort should be spent in trying to understand varying model performances, identifying if the difference is in the datasets (many simple situations versus many corner cases) or actually an issue of the model itself.

**Keywords:** automated driving; data-driven modeling; deep learning; scene prediction; trajectory prediction

## **1. Introduction**

In the last five years, huge progress has been made in the technology of self-driving cars. Advanced Driver Assistance Systems (ADAS) have become very mature and are part of any new vehicle nowadays. First, functions beyond the ADAS level, e.g., lane keeping or lane change assistance, were introduced into the series market of conventional Original Equipment Manufactures (OEMs). Even more compelling were the advances in higher automated driving levels made by Tesla with the autopilot and full self-driving capability functions [1], even though, other than the feature's names would suggest, the driver is still responsible and needs to monitor the driving process at all times.

The Society of Automotive Engineers (SAE) defines [2] six levels of Automated Driving (AD) (see Figure 1). ADAS systems are limited to levels 1 and 2, where the human driver is still the one operating the vehicle. Although the system can take full control already for specific use cases at level 2, the human driver needs to be attentive all the time, meaning eyes on the traffic at any time. It allows for temporary hands-off but the system reminds the driver to take back control after a short time period.

**Citation:** Stockem Novo, A.; Krüger, M.; Stolpe, M.; Bertram, T. A Review on Scene Prediction for Automated Driving. *Physics* **2022**, *4*, 132–159. https://doi.org/10.3390/ physics4010011

Received: 1 November 2021 Accepted: 7 January 2022 Published: 1 February 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

**Figure 1.** Automated driving (AD) levels according to the definition of the Society of Automotive Engineers.

The actual AD use case starts from level 3. The challenging task in designing a level 3 system is that a human driver who can go hands- and eyes-off must be notified on time to take back control in a period of around 10 s. Bridging the time span of 10 s in case of a complex situation is a great challenge such that some OEMs are discussing the strategy of skipping this level, going straight to high automation level 4. Here, in contrast to the reaction times of human drivers, the vehicle can handle problems on its own within just fractions of a second, or go to a fail-safe condition if no other solutions can be found. The final level 5 can handle all kinds of situations and use cases. A steering wheel might not be present anymore.

In 2019, the first autonomous taxi fleets were announced for the year 2020 [3]. However, this was delayed partially due to legislation and partially due to technical issues. Only slowly, the first test cases have been introduced. While the use case automated highway driving or level 4 driving on well-defined urban test fields can be handled quite well already, the big challenge is handling fully AD level 5 for all possible scenarios and corner cases. This includes strongly populated cities where it is not possible to anticipate the traffic scene for more than 1–2 s reliably or situations that occur very seldom.

The advantages of AD are numerous and quite obvious: an increase in comfort for the passengers, along with more flexibility for young, old or disabled persons. Furthermore, AD will reduce the number of accidents. A simulation study of fatal crashes predicted a reduction of collisions by 82% [4]. Naturally, also machines will fail but the amount and severeness of accidents is expected to be less compared to human drivers. This aspect is followed by a number of ethical questions, e.g., how to decide which action to take in case of an unavoidable crash? This is still a question under discussion. Moreover, human failures are better accepted than accidents caused by a machine. A further advantage to mention is that driverless systems can be optimized in economic aspects. Car sharing concepts are being designed which will reduce the number of required vehicles by efficient usage and fewer parking spaces will be needed. Traffic jams often occur due to inattentive drivers and unnecessary braking maneuvers [5]. This can easily be avoided in driverless traffic with the help of car-to-car communication [6].

While trying to reach the goal of fully automated vehicles, it is important having in mind the demands of the industry. First, AD vehicles need to be equipped with multiple sensors. However, the costs need to be low enough that either individual rides with a robotaxi service or the purchase of an actual automated vehicle is affordable, which obviously addresses people of different income classes. Second, current systems are often requiring high-performing computer hardware. The aim is to go towards embedded systems which are drastically limited in run-time and memory. Last but not least, an AD system has to be robust and secure, meaning that faulty sensors or adversarial attacks [7,8] need to be handled reliably. In order to get permission to drive on public roads, a company needs to undergo a formal procedure that guarantees the functional safety of the system. This latter

topic is not trivial and is still a road blocker for bringing some technical applications to the market.

Despite the recent advances in AD, there still exist challenges which are some of the main reasons why fully AD cars cannot be brought onto the streets yet. One of the most challenging problems focused on in this review and being a field of increasingly active research is the prediction of future traffic participants' behavior. The intentions that lead to certain maneuvers or actions are not always obvious but rather hidden as latent factors. Being able to identify such intentions several seconds before the event, makes it possible to comfortably adjust the own driving trajectory.

A second major challenge is the handling of uncertainties. The perception of the environment is associated with uncertainties, especially when fusing information from different sources. On the one hand, multiple signals provide redundancy and thus more reliability, on the other hand, creating higher uncertainties if the individual signals do not match. Based on these signals, robust environment models need to be developed for a planning of the ego vehicle trajectory. The problem is highly complex and dynamic due to unpredictable external factors, e.g., changing weather conditions, varying environments possibly without road marking or sudden movements of humans and cyclists. Deep learning techniques are quite robust against these uncertainties.

Scene prediction is a field of active research with increasing interest as can be seen from Figure 2. In the following, this review is delimited from previous surveys. Surveys on AD often focus on perception issues and challenges related to fusion of different sources [9–11] or psychological aspects related to the interaction between the human driver and the machine [12–14]. Some earlier surveys of scene prediction give a good overview of the different approaches but do not consider yet advanced deep learning techniques [15,16]. The review by Xue et al. [17]. focuses on scene understanding, event reasoning and intention prediction, distinguishing between short-term predictions (time horizon of a few seconds) and long-term predictions (time horizon of several minutes) but is kept on a rather high level in terms of methodology.

**Figure 2.** Number of papers per year addressing scene prediction (based on [18]).

Scene prediction often takes over new deep learning techniques from the field of natural language processing, a fast moving research area where huge progress has been made in the last three years. Naturally, those latest techniques are contained only in the most recent reviews. Refs. [19,20] highlight the importance of deep learning models but put emphasis on the visual information as input source, such as feature extraction from 3-dimensional (3D) images and videos. In the study by Yin et al. [21] and Tedjopurnomo et al. [22], deep learning techniques are presented in detail but the evaluation is done for traffic flow models. Such models have a time horizon of several minutes in contrast to the models in the paper at hand which are on the horizon of up to 10 s.

In contrast to the paper by Rasouli [23], where a detailed overview of state-of-theart deep learning techniques is also presented, here, the approach of a historical review is chosen by analyzing the development of the methodology in this field of research. A comparison of the model results reveals the dominance of neural networks in current state-of-the-art methods. We experience the difficulty of quantitative inter-model comparison from [23] since many models are not evaluated on standardized test datasets. However, the value of a competition of performance improvements on the cm scale is questioned here. The focus should be rather on understanding the reasons for significantly varying performances on different datasets which has not been addressed in detail yet.

The review starts with a concise introduction to AD in Section 2 with the aim of placing scene prediction in the context of AD and describing the challenges. In Section 3, the standard models for scene prediction are presented. The historical context and major achievements in scene prediction are presented in Section 4, followed by a short presentation of publicly available datasets in Section 5. The model comparison is discussed in Section 6. Finally, conclusions are drawn and future challenges are addressed in Section 7.

#### **2. The Context of Scene Prediction for Automated Driving**

#### *2.1. Sensors*

The foundation of a well-performing AD system is a robust perception model. This can only be achieved with a redundant sensor setup, e.g., such as the one shown in Figure 3.

**Figure 3.** Example sensor setup for a self-driving vehicle.

Besides standard sensors such as an Integrated Motion Unit (IMU) for measuring acceleration and a Global Positioning System (GPS), AD vehicles are typically equipped with a camera system, radars, and Lidars, each providing a 360◦ surround view in the best case. The variety of sensors takes advantage of the different physical properties but also brings redundancy to the system.

The optical camera is usually good for classification tasks such as distinguishing the type of a road user, recognizing lane markers or traffic signs. While the performance on measuring distances and velocities is rather weak, this information can be retrieved well from radars. Lidars are complementary to the other two sensors, showing only a few weaknesses. Distances and velocities can be estimated with very high accuracy. The only disadvantage are the high costs of a Lidar system.

Furthermore, high-definition maps as an additional sensor are currently being integrated into level 2+ systems, providing a spatial resolution of a few centimeters. This information is especially important for the urban environment. The detailed road infrastructure, such as lane marking and shape or traffic lights and signs, is collected in huge data collection campaigns and is then abstracted with the above mentioned sensor output.

#### *2.2. Evolutionary versus Revolutionary Approach*

According to the above definition of AD levels, current series market technology has reached level 2+. This is an intermediate level between 2 and 3 which has been introduced to indicate that level 2 has technically been exceeded already. Due to some technical and legal aspects which are also discussed in this paper, the transition to level 3 is not trivial and has not been accomplished yet. There are basically two strategies for progressing to higher automation levels: conventional OEMs and Tier 1 suppliers follow a conventional bottom-up approach, which is often also called an evolutionary approach. It consists of a building block architecture as shown in Figure 4. The task *automated driving* is broken down into three main components: a perception of the environment, a decision making part with the sub-task *scene prediction* located in the module *behavior planning*, and an acting part. Each of such components is further broken down into smaller modules so that an individual module itself can be tested in terms of function safety and quality.

**Figure 4.** Simplified high-level architecture for self-driving vehicles following the conventional approach of most OEMs and Tier1s. See text for details.

Companies such as Waymo or Zoox follow a fundamentally different approach, often called the disruptive or revolutionary approach. Taking the human as a potential operator of the vehicle out of the loop allows for entirely different vehicle concepts. For instance, in a level 5 there is no need to equip the vehicle with driving control input devices such as a steering wheel or pedals. Therefore, an electric vehicle could become direction-independent, making it equally drive forward and backward. A separate scene prediction module or sub-module may not be required in this approach anymore.

#### *2.3. Scene Prediction and Its Challenges*

The goal of scene prediction is to anticipate how a traffic scene will evolve within the next seconds. All relevant agents contributing to the scene are described by their states (position, velocity and heading angle), which shall be predicted with the highest accuracy possible. A relevant agent or object is one that influences the trajectory of the ego vehicle within the considered time frame. For SAE level 3, it is aimed to reach a prediction horizon in the order of 10 s. Based on the prediction, the ego trajectory can be planned and maneuvers can be executed.

Among the biggest challenges in scene prediction are the complex spatio-temporal dependencies and high dynamics. In addition an action of a traffic participant is affecting all surrounding participants. These subtile indications are not obvious and thus not possible to describe with simple physical models. Data-driven techniques and especially deep learning models have the potential to make these predictions and to satisfy the challenges. Simple data-driven models identify the most common trajectories in historical data, deriving typical maneuver classes. More advanced and deeper models are capable of identifying typical patterns self-consistently, often generating the most likely trajectories associated with a probability. The requirement is that the scene prediction should anticipate intented actions of others in the scene with some time ahead at least as well as or better than a human driver. This also means, however, that one needs to accept the existance of unpredictable situations, such as sudden decisions, which neither a human driver, nor an automated scene prediction could ever anticipate.

The uncertainties from perception, which have already been mentioned as AD challenge, are also a determining factor for the performance of a scene prediction model. In order to understand the current situation, the system has to make the right decisions often based on noisy or sparse data. Additionally, for this aspect, deep learning models show the best performance compared to other approaches due to their ability for generalization.

#### **3. Methods for Scene Prediction**

In this section, the most widely applied methods for scene prediction are presented. These models are generic models, not specific to scene prediction or AD, and can be divided into model-driven and data-driven approaches. The first have a rather small prediction horizon which is why they only play a minor role these days but are mentioned for completeness.

#### *3.1. Model-Driven Approaches for Scene Prediction*

#### 3.1.1. Kinematic Models

The simplest approach for extrapolating the trajectory of a traffic participant is by considering purely the kinematics of an object. Thus, the object's trajectory **<sup>x</sup>**(*t*), **<sup>x</sup>** <sup>∈</sup> <sup>R</sup><sup>2</sup> and *t* is the time, in the plane is simply described by:

$$\mathbf{x}(t) = \mathbf{x}(0) + \mathbf{v}t + \frac{1}{2}\mathbf{a}t^2 \tag{1}$$

with **v** being the velocity and **a** the acceleration. Often, the assumption of constant velocity or constant acceleration is made, which works well for highway situations with not much traffic, but cannot take into account more dynamic situations which involve interactions among the traffic participants. A more sophisticated approach is applying a transformation into a curvi-linear coordinate system, resulting in a more consistent driving behavior.

#### 3.1.2. Dynamic Models

More advanced models also take into account the different forces acting on an object. The models usually start from the action,

$$S(\mathbf{q}) = \int\_{t\_1}^{t\_2} L(t, \mathbf{q}(t), \dot{\mathbf{q}}(t)) \, \mathrm{d}t,\tag{2}$$

where *L*(**q**, **q**˙ , *t*) is the Lagrangian depending on the generalized coordinates **q**, their *t*derivatives **q**˙ and time *t*. In the context of curvy roads, the jerk **j**(*t*) = **a**˙(*t*) is especially important for vehicle dynamics.

Such models can become quite complicated, while still describing only a small time horizon, which is why they receive only minor attention in the context of scene prediction.

#### 3.1.3. Adding Uncertainties to the Model

Uncertainties in the prediction of moving objects are commonly addressed by adding a Gaussian noise term, a normal distribution centered around the mean value. This approach is closely connected to the Kalman filter [24] which assigns a Gaussian noise profile to the sensor measurement. The uncertainty is then propagated iteratively in a step-wise calculation of the estimated rajectory,

$$\mathbf{x}(t+\Delta t) = \Phi(t+1;t)\mathbf{x}(t) + \mathbf{u}(t),\tag{3}$$

where Φ is the transition matrix and **u** a Gaussian noise term with expectation value zero.

Monte Carlo simulations are a further method for considering uncertainties in modeldriven approaches [25]. Starting from a set of input variables, the vehicle trajectories are modeled based on physical assumptions and sampling of dynamic variables. This technique also allows to introduce contraints, e.g., due to road boundaries.

#### *3.2. Data-Driven Approaches for Scene Prediction*

In data-driven development, the modelling can be done only on a large database of recorded or simulated traffic situations. The first step in model building is feature extraction, which covers the identification of relevant data and data preprocessing. These data are then used for deriving a generalized model.

#### 3.2.1. Classic Methods

While deep learning models have become state-of-the-art for traffic modelling, let us first give a brief overview of classic data-driven approaches.

#### Hidden Markov Models

Markov processes are statistical models that describe the probability or likelihood of observation sequences [26]. Given a system with *N* distinct states, *S*1, ... , *SN*, at any time *t*, the probability of state transitions between two states is:

$$a\_{i\rangle} = P[q\_t = \mathcal{S}\_{\rangle} | q\_{t-1} = \mathcal{S}\_i] \qquad \text{for } 1 \le i, j \le N,\tag{4}$$

with *qt* being the observable state at time *t*. The transition probabilites can be extracted from a set of data.

Hidden Markov Models (HMMs) [26] are a further extension of Markov processes for non-observable, underlying states. Equation (4) applies also in this case, with an important difference of non-observable states, *S* = *S*1,..., *SN*, and *M* distinct observations, *V* = *v*1,..., *vM*. The distribution of observed states is then given by:

$$p\_j(a) = P[q\_t = S\_j], \qquad \text{for } 1 \le j \le N, \ 1 \le a \le M. \tag{5}$$

The aim is to model the above parameters such that the observed sequence of states is correctly modeled. The transition probabilities as well as the relationship between the observed states and non-observable events is learned from data.

In the context of scene prediction it is the prediction of consecutive traffic maneuvers. The drawback of this method is that it considers distinct maneuvers and does not take into account interactions of traffic participants [26].

#### Regression Models

For a given dataset, regression models find a continuous function to describe the relationship between independent and dependent variables. Different types of regression models exist and the choice depends on the problem formulation. Polynomial regression models work well for trend forecasting. Logistic regression outputs values between 0 and 1 and is therefore good for classification tasks. Bayesian regression assumes that the data is described by a normal distribution, trying to estimate the posterior probability based on a prior distribution.

#### 3.2.2. Neural Networks

The output of a neural network is either a classification or regression. In correspondence to biological neurons in the brain, the model consists of artifical neurons (knots) and connections (edges) which are stacked in layers. The impact of a transmitted information from one neuron to another is modeled by a weighted connection.

#### Feed-Forward Neural Network

In a simple Feed-Forward Neural Network (FFNN), the information is passed from one layer to the next by a simple weighted summation of the input, see Figure 5. The output at any neuron *k* in the higher layer is given by *yk* = *σ*[∑*<sup>N</sup> <sup>i</sup>*=<sup>0</sup> *WikXi* + *Bk*] with a non-linear activation function, *<sup>σ</sup>*, input, *<sup>X</sup>* = (*X*0, ... , *XN*), weight matrix, *<sup>W</sup>*, of R*N*×*m*, and bias, *<sup>B</sup>*, of R*m*. The compact form can be written as:

$$y = \sigma \left[ W^T X + B \right] \tag{6}$$

with *y* = (*y*0, ... *ym*) and *B* = (*B*0, ... , *Bm*). The weights and bias terms are learned in an iterative process via the backpropagation algorithm [27].

**Figure 5.** A simple feed-forward neural network. See text for details.

One shortcoming of feed-forward neural networks in the context of scene prediction is that they do not directly model temporal dependencies. Instead, temporal information has to be manually encoded into the input representation, of which there are many. Finding a good representation then becomes itself a complicated task.

#### Recurrent Neural Network

For time series prediction tasks a more sophisticated network structure is generally used. In Recurrent Neual Networks (RNN) the information from previous time steps is stored in so-called memory cells, i.e., feedback loops that memorize the information in hidden state vectors. This is shown in Figure 6 for a simple RNN model with a feedback loop.

**Figure 6.** A simple recurrent neural network. See text for details.

The additional state vector,

$$h\_t = \sigma \left[ WX\_t + Lh\_{t-1} + B \right],\tag{7}$$

serves for memorizing information. The matrix *U* is a further parameter that needs to be learned during the training process. *ht* and *yk* are identical in this case.

#### Long Short-Term Memory

Simple RNN models provide the benefit of storing past information but they show some major problems. The stored information is limited to short sequences only and they are hard to train since they suffer from vanishing and exploding gradients [28,29]. A more robust RNN architecture that almost completely circumvents these problems, is the long short-term memory neural network (LSTM) [30]. This neural network architecture is capable of memorizing hundreds of time steps. The problem of vanishing and exploding gradients is solved by the introduction of three so-called gates (input, forget, and output gate) controlling the flow of information through the LSTM layer and a truncation of the gradients in the learning algorithm.

The architecture of an LSTM is shown in Figure 7. Similar to the simple RNN, one input to a LSTM cell is the input from the layer before at current time step *t*, *Xt*. Additionally, a second input is the hidden state vector from the previous time step *ht*−1. This state vector stores the short-term information. On the other hand, an additional cell state *ct*−<sup>1</sup> is introduced which stores the long-term information. The output of the LSTM cell is again *yt* as well as the hidden and cell state vectors, *ht* and *ct*. Note that only *yt* is passed to the next layer, which is identical to *ht*. The details of the computation are given in Appendix A.

**Figure 7.** A long short-term memory neural network. See text for details.

There exist several variations of the LSTM architecture, e.g., the Gated Recurrent Unit (GRU), a simpler form with only two gates. The number of trainable parameters is reduced, thus it is very efficient and performs especially well on smaller data sets. The original GRU architecture [31] consists of a reset gate and an update gate. More details are given in Appendix B.

#### 3.2.3. Encoder–Decoder Models and Attention Mechanism

Natural language processing is one of the hottest topics in machine learning these days. It was found that those models perform well also in other fields, such as scene prediction. One major improvement is the so-called attention mechanism. It is based on an encoder–decoder architecture, a sequence-to-sequence model.

A simple encoder–decoder model is shown in Figure 8, each consisting of a single recurrent unit. The encoder takes as input a time series *X* of length *M*, for which the hidden states are calculated. Only the last hidden state, *hM*, is passed to the decoder, where it serves as the initial state vector. The decoder takes as input the state vector, *hM*, as well as one item at a time of the labeled output sequence, such that it can predict the next item in the sequence based on the previous output, i.e., <start>→ *y*0, *y*<sup>0</sup> → *y*1,..., *yL* →<end>.

**Figure 8.** A simple encoder–decoder model for time series. See text for details.

Attention models extend this architecture by outputting not only the final state vector *hM* but all hidden states which are then combined to a context vector [32]. The context vector is a weighted sum of the hidden states:

$$h' = \sum\_{m=1}^{M} \alpha\_m h\_m \tag{8}$$

with the weights being normalized with a softmax function:

$$\alpha\_{\rm nl} = \frac{\exp(b\_{\rm nr})}{\sum\_{k=1}^{M} \exp(b\_k)}. \tag{9}$$

*bk* is a trainable parameter which takes the context vector and the hidden states of the decoder as input. In this approach, the model gets information from all hidden states. It is trained to adjust the weights in a way that it gives higher value—or attention—to more important parts of the input sequence.

#### Variational Autoencoder

An autoencoder is a system of encoder and decoder which is used in an unsupervised learning fashion. When data is fed into the encoder, it learns an abstraction into latent space by dimensionality reduction. The decoder receives the output from the encoder and is then trained to reconstruct the original data. The original data thus serves as the ground truth label during the training process, with the goal of minimizing the reconstruction error.

The problem with the simple autoencoder is that it tends to overfit the latent representation if no regularisation methods are used. Therefore, during encoding into latent space not just one data sample is used but a probability distribution with mean and standard deviation, often a Gaussian distribution, resulting in multiple different model outcomes. This extension of the simple autoencoder is called a variational autoencoder [33]. It became very popular in computer vision for generation of virtual images, especially in the context of StyleGAN2 [34]. The idea is to use the model reconstruction for generating new data samples.

A conditional variational autoencoder [33] learns a distribution of a so-called *latent variable z*. One part of the training objective, the Evidance Lower Bound (ELBO) is the Kullback–Leibler-divergence between the prior and the posterior distribution of the latent variable (the second part of the loss function is the data log likelihood). For trajectory prediction, the prior is usually only conditioned on the observation period, while the posterior is conditioned on the observation period and the ground truth future trajectory. During training, the model is supposed to learn to approximate the posterior distribution with the prior distribution, as the information about the ground truth future trajectory is not available during inference. Instead it is actually the task of the model to predict this future trajectory. While inference the prior is used to sample an actual value *z* from the

latent variable distribution *z*. This sample is then used to condition the trajectory prediction on. Repetitive sampling from *z* allows to generate an entire probability distribution for the prediction. Therefore, such a distribution can capture all the initially addressed issues of uncertainty, ambiguity, and multi-modality due to the variety of those trajectory samples. Often the conditional variational autoencoder framework is combined with a RNN-based encoder–decoder architecture.

#### Convolutional Neural Network Models

Time series can also be handled quite well with 1D Convolutional Neural Networks (CNNs). As the name suggests, 1D CNNs use a 1D filter kernel in order to apply convolutions [35]. The discrete 1D convolution for a time series *X* = (*X*0, ... , *XN*) is then given by

$$\varphi\_{i} = (X \ast K)(i) = \sum\_{m=0}^{M} X\_{i - m \ast s} k\_{m} \tag{10}$$

with 1D filter kernel *K* = (*k*0, ... , *kM*) which is a trainable parameter for the network. The dimension length *M* of the filter kernel is specified by the user and determines the length of the time sequence that is taken into account, as well as the stride *s* which determines the frequency at which the input sequence is sampled. For combining spatio-temporal information, the model can be extended to higher dimensions. The biggest advantage in comparison to standard feed-forward-neural networks is that the input representation, this means which features are important and should be extracted, here can be learned.

#### Transformer Models

Another powerful state-of-the-art neural network model from natural language processing is the transformer architecture [36]. Such models consist of several stacked encoders and decoders, each constructed in the same way: a combination of *attention* and FFNN layers.

The input is passed sequentially along the stacked encoders and decoders, respectively, but also from each encoder to the corresponding decoder. The self attention layers serve as putting an item of a sequence in the entire context and can determine its importance for the entire sequence.

#### Graph Neural Network Models

An alternative representation of the data is used in Graph Neural Networks (GNN) which have become popular recently [37]. The input data is presented in a graph structure, which reflects the structure of the system. This approach provides the benefit that only existing connections are considered, in contrast to a fixed array representation in the previous approaches. Therefore, CNNs can be considered as special cases of GNNs, where the local connectivity of all nodes represented by convolutional filters. It is especially effective when dealing with sparse data.

A graph is made up of nodes *N* and edges *E* (see Figure 9):

$$G = (N, E). \tag{11}$$

The binary adjacency matrix *A* contains the relationships of the nodes, which can be either undirected (*A* is symmetric) or directed (*A* is not symmetric). Furthermore, each node is associated with a feature matrix *X*. The goal is to train a neural network that takes a representation of the graph structure as input, which is then being transformed into an embedding.

**Figure 9. Left:** Graph representation. **Right:** Aggregation of neighbors for node embedding. See text for details.

The key idea with GNNs is that the node embeddings are learned with neural networks. A popular approach is the GraphSAGE [38], a graph convolution network (GCN), where all neighbors of a target node are identified and their contribution to a mother node is aggregated. The squares in Figure 9 represent neural networks. The node messages are then calculated in the following way. The first embedding in the 0-th layer is just the node feature itself,

$$h\_v^0 = X\_{v\prime} \tag{12}$$

with *v* referring to the *v*-th node. Embeddings in further layers are then calculated as

$$h\_{\upsilon}^{l+1} = \sigma\left(\mathsf{W}\_{l+1} \cdot \mathsf{CONCAT}\left[h\_{\upsilon\prime}^{l}, h\_{N(\upsilon)}^{l+1}\right]\right) \tag{13}$$

for *l* ∈ (0, ··· , *L* −1) with *L* the number of embedding layers, *N*(*v*) being the neighborhood function and weight matrix *Wl* and CONCAT being the concatenation function [38].

#### **4. Historical Review of Relevant Work**

This Section illustrates the evolution of the field *scene prediction* from a historical point of view. Contributions with highest impact, starting from the beginnings of this field of research, and ending with the state-of-the-art research are highlighted. The presented selection is based on the impact on the community, quantified by the number of citations in the context of AD.

As highlighted here, there has been a strong increase in the amount of research that was done on trajectory prediction in the past five years. Additional expertise has been focused on vehicle trajectory prediction originating from the robotics, deep learning, and computer vision community. Beside leading to great progress on the problem itself, the gathering of researchers and research groups in this field has led to a strong parallelization of that research. So, many similar approaches were developed around the same time, leading to clusters of certain methods. Compared to less intensive investigated research fields, this also has led to a less successive development in the research history of trajectory prediction. Due to the great progress made in the last few years there emerges one import question: How accurate does trajectory prediction have to be (to enable a certain level of automation–according to the SAE definition)? Or shorter: How accurate is accurate enough? While the exact answer to this question is out of scope of this paper and seems very hard to answer too, there is one important aspect for us regarding its core. Due to the continuing focus on certain datasets for the development and the provided infrastructure and Application Programming Interfaces (APIs), trajectory prediction gradually becomes a research challenge. While the competition between single researches and research groups leads to more and more precise prediction models, improvements in the magnitude of centimeters may decide about the order on the leader board but it is an open question how much such improvements contribute a better driving experience of an automated vehicle. Closing the loop, this is related to the challenge of a very active and dynamic research field and how to interpret and evaluate the results. Less recent approaches are characterized by an additional problem, which is related to used data. While the most recent approaches focus on only very few datasets for evaluation, older papers have worked on different datasets, making it hard to compare those approaches and their results against each other. All those thoughts motivated the structure of the upcoming chapter, the selection of the presented papers, and their evaluation and assessment.

The term "scene prediction", as used in this paper, is a fully self-consistent description for predicting all traffic participants in a certain area constituting one driving situation. Since this is a very complex task, initially only sub-tasks of scene prediction were focused on.

#### *4.1. Recognition of Other Drivers' Intentions*

First approaches towards the modeling of traffic behavior were focused on specific use cases. The first highlighted use case is the lane change prediction. Then, the use case of car-following, which focuses only on the longitudinal part of the trajectory, is briefly presented.

#### 4.1.1. Lane Change Prediction

The goal of the lane change prediction is to identify the intention of other drivers to change the lane. It is possible to detect such an intention several seconds before the actual event, such that the automated ego vehicle can prepare to react on it.

The input to such a model is usually a feature vector **X** describing the state of a target vehicle for a history over discrete time steps. **X** contains the distance and relative velocity to the ego vehicle and surrounding vehicles in the vicinity of the target vehicle. The output is a classification of the maneuver with classes "lane change left", "lane change right" and "lane keeping".

The first step towards tackling the lane change use case was understanding the underlying decision process. In 1986, Gipps identified the three central items before making a lane change as (i) the physical possibility for changing the lane, (ii) the necessity as well as (iii) the desirability which were decided upon in a flowchart [39]. The describing terms were initially addressed by simple mathematical formulas, providing the fundamentals for a microscopic approach. Based on this study, Toledo et al. [40] refined the model by a probabilistic description of the utility terms.

Most of the following publications consider this decision process as a latent variable which is not directly observable, and focus on classifying whether a lane change is going to happen. Hunt and Lyons [41] developed a simple neural network, which is a great method for classification tasks, achieving reasonable results on a simulation data set. The encountered problems during model development were "little guidance [...] available on the selection of network architecture and the most appropriate paradigm" [41] which is the determining factor for the success of the outcome, as well as the worse performance on real data during inference.

A study [42] that received much attention, analyzed the driver's behavior during the lane change process and found that the typical sine-wave steering pattern was accompanied by a slight deceleration before the actual maneuver. Surprisingly, only 50% of the events had an activated turn indicator signal with 90% reaching into the lane change maneuver. By observing the eye movement, it was found that the driver takes off focus from the current driving lane approximately 5 s before the start of the lane change.

A typical approach for modeling the intention of the driver are HMMs. It seems that the probabilistic step-wise state progression fits well the execution of a lane change maneuver. The probabilistic modeling of the state transitions, allows to predict the maneuver approximately 1 s before it takes place [43,44]. Approaches based on Bayesian statistics which are often combined with Gaussian mixture models achieved similar performance [45–48]. Only few publications use physics-based models, e.g., describing the traffic flow as a continuous

fluid [49]. More recent approaches rely on machine learning techniques such as support vector machines [50–52] or neural networks [16,53,54].

#### 4.1.2. Car-Following

A further use case is car-following, which is similar to the adaptive cruise control function and part of lane keeping. In order to keep a dynamic safety distance to a leading vehicle, a smooth acceleration model is required to avoid immediate reaction on any sudden acceleration or braking of the leading vehicle. Interestingly, the first models were developed for city traffic, which for scene prediction is a use case considered more difficult than highway driving.

The input to car-following models is usually a feature vector **X**, where spatial coordinates and velocities are one-dimensional and aligned with the driving direction. The output of this model is an acceleration or velocity for the ego vehicle.

Due to the reduced dimensionality of the problem, this use case is more often approached with physics-based models; see, e.g., [55,56]. Here, the study by Helbing and Tilch in 1998 [57], which received more than 5000 citations, to be mentioned. The authors developed a generalized force model which uses the formalism of molecular dynamics for many particle systems. Equations of motion describe the effective acceleration and deceleration forces. The so-called social forces acting on an agent tries to reflect internal motivations. Context information such as speed limits, acceleration capabilities of vehicles, the average vehicle length, visibility and reaction time were introduced as direct parameters. The advantage of this model is a good understanding and insight into the model due to easy interpretation of the parameters.

#### *4.2. Full Trajectory Prediction*

Trajectory prediction is so complex that (almost) all approaches are data-driven. The difficulty lies in the modeling of all factors determining the internal object states and perception. The input to the model is a feature vector **X** of the observed scene. **X** can be represented by trajectory coordinates (Figure 10, left), occupied grid cells on an occupancy map (Figure 10, right), semantical features or a combination of those. These data are collected with sensors and can be raw or further processed signals. The output is a prediction of the positions and states of all traffic participants within a defined time horizon. The output can be a binary classification for an occupancy grid, or a regression model outputting floating numbers for future trajectory points of the objects.

**Figure 10. Left**: Objects are represented by their position and velocity. **Right**: Objects are associated to a cell in an occupancy grid.

In Table 1, quantitative evaluation results of full trajectory prediction models are collected. If available, the Final Displacement Error (FDE), Average Displacement Error on the entire trajectory (ADE), Root Mean Squared Error (RMSE) or Mean Absolute Error (MAE) is given with prediction horizons in brackets. For an evaluation of multiple modes, the number of samples *K* is also mentioned.


**Table 1.** Overview of publications on scene prediction for AD. Neural network models are grouped under "L" (LSTM/GRU), "C" (CNN), "G" (GNN), and "A" (Attention). Some papers are also available at [58]. For the error it is specified the number of modes, *K*, over which the prediction is sampled, if applicable. See abbreviations list.

<sup>1</sup> Only lateral position considered.
