M2Tames: Interaction and Semantic Context Enhanced Pedestrian Trajectory Prediction

Gao, Xu; Wang, Yanan; Zhao, Yaqian; Li, Yilong; Wu, Gang

doi:10.3390/app14188497

Open AccessArticle

M²Tames: Interaction and Semantic Context Enhanced Pedestrian Trajectory Prediction

by

Xu Gao

^1,2

,

Yanan Wang

^1,2

,

Yaqian Zhao

^1,2

,

Yilong Li

³

and

Gang Wu

^1,2,*

¹

School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou 450001, China

²

National Supercomputing Center in Zhengzhou, Zhengzhou 450001, China

³

School of Computer and Information Engineering, Henan University, Kaifeng 475000, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(18), 8497; https://doi.org/10.3390/app14188497

Submission received: 2 July 2024 / Revised: 11 September 2024 / Accepted: 16 September 2024 / Published: 20 September 2024

(This article belongs to the Special Issue Deep Learning and Machine Learning in Image Processing and Pattern Recognition)

Download

Browse Figures

Versions Notes

Abstract

:

Autonomous driving pays considerable attention to pedestrian trajectory prediction as a crucial task. Constructing effective pedestrian trajectory prediction models depends heavily on utilizing the motion characteristics of pedestrians, along with their interactions among themselves and between themselves and their environment. However, traditional trajectory prediction models often fall short of capturing complex real-world scenarios. To address these challenges, this paper proposes an enhanced pedestrian trajectory prediction model, M²Tames, which incorporates comprehensive motion, interaction, and semantic context factors. M²Tames provides an interaction module (IM), which consists of an improved multi-head mask temporal attention mechanism (M²Tea) and an Interaction Inference Module (I²). M²Tea thoroughly characterizes the historical trajectories and potential interactions, while I² determines the precise interaction types. Then, IM adaptively aggregates useful neighbor features to generate a more accurate interactive feature map and feeds it into the final layer of the U-Net encoder to fuse with the encoder’s output. Furthermore, by adopting the U-Net architecture, M²Tames can learn and interpret scene semantic information, enhancing its understanding of the spatial relationships between pedestrians and their surroundings. These innovations improve the accuracy and adaptability of the model for predicting pedestrian trajectories. Finally, M²Tames is evaluated on the ETH/UCY and SDD datasets for short- and long-term settings, respectively. The results demonstrate that M²Tames outperforms the state-of-the-art model MSRL by 2.49% (ADE) and 8.77% (FDE) in the short-term setting and surpasses the optimum Y-Net by 6.89% (ADE) and 1.12% (FDE) in the long-term prediction. Excellent performance is also shown on the ETH/UCY datasets.

Keywords:

trajectory prediction; attention mechanism; autonomous driving; deep learning

1. Introduction

Pedestrian trajectory prediction (PTP) is a critical task in the fields of computer vision [1,2,3] and autonomous driving [4], marked by significant challenges. The core of this research extends beyond understanding the individual motion characteristics of pedestrians, also known as agents: it necessitates a thorough investigation into the complex interactions among agents as well as between agents and their environment. As illustrated in Figure 1, using the blue agent as an example, the interactions among agents and between agents and the environment are depicted. These interactions are not only multimodal but also highly dynamic, influenced by human self-awareness and social behavioral norms. Thus, predicting the motion trajectories of multiple agents requires modeling or considering two key factors: temporal aspects and social dynamics.

Temporally, the historical state of each agent impacts its future state. It is essential to model the temporal dependencies of each agent’s past state to predict their future trajectories. Socially, the states of surrounding agents and environmental information continuously influence the decision-making of a central agent. Therefore, modeling the interactions among agents and the complexities of the surrounding environment is necessary to capture intricate social dynamics.

Early models for crowd trajectory prediction primarily relied on manually designed energy functions to simulate interactions among individuals. This approach necessitates extensive feature engineering and is limited in capturing pedestrian interactions in busy scenes. However, the advancement of deep neural network technologies, particularly the application of Recurrent Neural Networks (RNNs) in pedestrian trajectory prediction, offers new perspectives to address this challenge. RNN-based models [1,2], by analyzing the hidden states of pedestrians, capture their motion trajectories and simulate interactions among individuals by integrating the hidden states of surrounding pedestrians. This approach has somewhat improved the understanding of social behavioral dynamics.

As a complement to RNN models, social-pooling [1,2] treats the states of neighboring pedestrians equitably and uses pooling techniques to integrate these states, thereby considering the spatial relationships between agents. In contrast, attention mechanisms [5,6], as advanced models, differentiate the importance of nearby pedestrians through learned weighting functions, providing a more flexible approach to trajectory prediction. Despite these advances, existing attention mechanisms still exhibit limitations in fully simulating the complex interactions among agents and between agents and their environment [7].

The traditional attention mechanism faces challenges in handling complex interactions among multiple pedestrians. In scenarios such as intersections, agents may exhibit various interaction modes, such as avoiding and merging. Traditional attention mechanisms struggle to adapt flexibly to these variations, resulting in a suboptimal performance in capturing relationships between multiple agents. Furthermore, the conventional single-head attention mechanism is inadequate in processing multi-scale inputs. In pedestrian trajectory prediction, agent movements may involve different spatio-temporal scales. The single-head structure of traditional mechanisms cannot simultaneously focus on and adapt to these varying scale features, thereby limiting the model’s capacity to handle multi-scale information.

On the other hand, in many traditional models, the environment is treated as a static backdrop or mere obstacles, without an in-depth consideration of its complex semantic information, such as road signs and traffic signals. This information is crucial for understanding and predicting pedestrian behavior patterns in specific environments. For instance, in urban street scenes, pedestrians’ choices of crossing locations are often influenced by the presence or absence of crosswalks. They tend to safely cross at locations with traffic signals, rather than riskily traversing streets with dense vehicle flow. Effectively integrating spatial information with pedestrian interaction data remains an open question.

In response to these challenges, this study introduces an innovative deep learning model M²Tames (multi-head mask temporal attention motion, interaction, semantic context) to improve the accuracy and adaptability of trajectory prediction. The new model encompasses not only the motion characteristics of agents but also the interactive features among agents as well as between agents and their environment. By employing an innovative multi-head mask temporal attention mechanism M²Tea, M²Tames is capable of simultaneously focusing on the agent’s motion characteristics and the interactive features between agents. Furthermore, I² allows the model to capture complex social behavioral patterns among agents, such as avoidance, following, and congregating. A deep understanding of these interaction modes significantly supports prediction accuracy. Lastly, through the integration of the U-Net architecture, M²Tames learns and interprets scene semantic context, thereby deepening the understanding of spatial relationships between agents and their environment.

In this context, our research aims to further enhance the accuracy of pedestrian trajectory prediction. Compared to Agentformer [5], our proposed model, M²Tames, not only considers the agents’ historical trajectories but also incorporates semantic information about the environment. The introduced M²Tea module employs advanced multi-scale feature processing strategies to address the shortcomings of previous works that focused solely on interactions among agents or solely between agents and the environment. Moreover, an improved interaction module, I², has been developed. Unlike previous models [1,2,6,7], this module emphasizes the importance of considering the movement patterns of nearby agents for accurately predicting pedestrian trajectories. This model is more effective in learning and adapting to complex interactions between agents and their environments, especially in densely populated scenarios.

The contributions of this paper are as follows:

This paper introduces a deep learning model M²Tames for pedestrian trajectory prediction. The model comprehensively considers the motion characteristics of agents and the interactions among themselves, or with their environment, thereby improving the accuracy and adaptability of trajectory prediction;
M²Tea designs a multi-head mask temporal attention mechanism that deeply characterizes temporal and interactive features, extracting the dynamic interaction features of pedestrians to enhance the model’s ability to learn and simulate real-world scenarios. Additionally, I² is designed to infer specific interactive relationships between agents. The goal of I² is to deduce the interactive relationships between agents based on the input motion data;
Extensive experiments are conducted on the proposed M²Tames using the publicly available real-world datasets from SDD and ETH/UCY, and the results demonstrate that M²Tames achieves high accuracy in predicting pedestrian trajectories in both long-term and short-term settings.

2. Related Work

2.1. Interaction Modeling

Agent-to-agent interactions: In the field of trajectory prediction, the application of artificial neural networks leads to significant advancements. Social-LSTM [1], as a pioneering work, employs LSTM networks to model pedestrian dynamics and captures interactions among individuals through social pooling. SR-LSTM [3] further quantifies these interactions using a state refinement module’s weighted mechanism. Social-GAN [2] introduces generative adversarial networks, offering diverse possibilities for trajectory prediction tasks for the first time. Sophie [8] integrates attention mechanisms to merge scene and social information for trajectory prediction. Trajectron++ [9] proposes a modular, graph-structured recurrent model to predict the future trajectory distribution of multiple pedestrians. Conditional Generative Neural System (CGNS) [10] combines the advantages of conditional latent space learning and variable dispersion minimization to learn the future trajectory distribution space. PECNet [11], employing an endpoint-conditioned encoder, simulates human trajectories and generates socially compliant trajectories through social spatio-temporal graphs. Despite these advances, these models still face challenges in handling complex social interactions.

I², proposed in this study and inspired by the work of [12], comprehensively considers agents’ historical trajectories and the motion patterns of nearby agents. It learns interaction types through multi-scale feature processing and effectively reduces the complexity and prediction error in pedestrian trajectory forecasting.

Agent-to-environment interactions: Early trajectory prediction research relies on physical models and statistical techniques, such as the social force model [13,14]. However, these models often neglect environmental semantic information, especially in complex urban settings. With the advancement of deep learning, particularly the application of Convolutional Neural Networks (CNNs), in image recognition and scene parsing, new trajectory prediction models begin to incorporate environmental semantics. For instance, Sophie [8] integrates attention mechanisms to merge scene and social information for trajectory prediction. Social Ways [15] uses generative adversarial networks to learn the distribution of pedestrian trajectories, emphasizing the generation of realistic trajectories while considering interactions among agents and between agents and the environment. SHENet [16] introduces a cross-modal interaction module for modeling the interactions between individual past trajectories and their surroundings. This study utilizes refined semantic segmentation through the modified U-Net architecture to learn spatio-temporal context information, aiding the model in better understanding interactions between agents and the environment.

2.2. Long-Term Trajectory Prediction

Recently, human motion prediction, particularly in the field of pedestrian movement, garners significant attention. Researchers categorize prediction problems into short-term (1–2 s), long-term (up to 20 s), and very long-term (over 20 s) forecasts. Various models developed for long-term prediction, such as Bera et al. [17], use Bayesian inference to learn global and local motion patterns for long-term trajectory prediction, including prediction based on vehicle routing and GPS data [18]. Andrey et al. [19] considered the semantic attributes of static environments for predicting long-term human movements, like utilizing the prior knowledge of environmental targets for planning-based predictions. Rudenko et al. [20] proposed long-term global motion prediction models based on Markov Decision Processes (MDPs). Tran [21] introduced a goal-driven long-term trajectory prediction model, simulating the process of determining pedestrian goals and their impact on future trajectories.

The motion prediction model in M²Tames utilizes the U-Net [22] architecture, originally from medical image segmentation. The U-Net encoder, through convolution and pooling operations, progressively extracts the scene semantic and complex features of the trajectory sequence. This process enables the model to develop a higher-level understanding of trajectories and scenes, encompassing a broad range of spatio-temporal context and global scene semantics. The design of U-Net allows the model to comprehend the global structure of pedestrian movement and long-term spatio-temporal relationships. Therefore, inspired by this, U-Net architecture combinations are designed as the core framework of M²Tames.

2.3. Transformer

The transformer [23] relies entirely on a self-attention mechanism to capture global dependencies in serialized inputs. It has been applied in multiple domains, including natural language processing [24], visual tasks [7,25,26], and speech recognition [27]. Particularly in visual tasks, such as Vision Transformers (ViT) [28], images are serialized into a series of tokens, and the global dependencies of the image are modeled using a transformer encoder. Recently, some studies have applied transformers to pedestrian trajectory prediction. STAR [7] utilizes temporal and spatial transformers to extract temporal dependencies and spatial interactions, while AgentFormer [5] employs an agent-aware transformer to learn representations from both temporal and spatial dimensions. Differing from previous research, inspired by AgentFormer [5], this paper designs a multi-head mask temporal attention mechanism M²Tea to deeply characterize temporal and interactive features. This mechanism is tailored to extract the dynamic features of pedestrians, thereby enhancing the predictive performance. This approach focuses on learning temporal and interactive features in pedestrian trajectories.

3. Proposed Approach

Figure 2 presents the architecture of the model proposed in this paper, dedicated to predicting pedestrian trajectories in various social environments. The structure of the model is divided into three main parts: the U-Net encoder, the innovative interaction module, and the U-Net decoder. Initially, pedestrian trajectories are processed through the U-Net encoder, which extracts features. This process transforms the trajectory data into high-dimensional feature representations. These features are then fed into the interaction module, specifically designed to capture the interactions between pedestrians and the impact of environmental factors. After the interaction analysis, the U-Net decoder processes the outputs of the interaction module to reconstruct the predicted trajectories. This design enables the model not only to consider individual historical movement information but also to dynamically understand and adapt to the behaviors of the surrounding pedestrians and environmental changes, thereby enhancing the accuracy and applicability of trajectory prediction.

In this section, M²Tames is introduced, a pedestrian trajectory prediction model that fully considers the motion characteristics of the agent itself, the interactions between agents, and the interactions with the environment. As shown in Figure 3, the core of the model architecture lies in modules M²Tea and I², which collaboratively learn the interaction features of agents.

The problem of multimodal trajectory prediction can be formalized as follows. Given an RGB scene

I_{r g b}

and some historical locations sampled at the same frame rate for past timesteps

t_{p} = 1, 2, \dots, T_{o b s}

, M²Tames aims to predict the pedestrian trajectories

Y^{'}

for future timesteps

t_{f} = T_{o b s} + 1, \dots, T_{p r e d}

, where

T_{p r e d}

is the predicted maximum future timestep. The past trajectory of pedestrian i(

1 \leq i \leq N

) in scene

I_{r g b}

is defined as

X_{i}

, where N is the total number of pedestrians:

X_{i} = {(x_{i}^{t_{p}}, y_{i}^{t_{p}}) | t_{p} = 1, 2, 3, \dots, T_{o b s}}

(1)

The future predicted and real trajectories of pedestrian i for timesteps

t_{f} = T_{o b s} + 1, \dots, T_{p r e d}

are represented as

Y_{i}^{'}

and

Y_{i}

, respectively:

Y_{i}^{'} = {(x_{i}^{t_{f}}, y_{i}^{t_{f}}) | t_{f} = T_{o b s} + 1, \dots, T_{p r e d}}

(2)

Y_{i} = {(x_{i}^{t_{f}}, y_{i}^{t_{f}}) | t_{f} = T_{o b s} + 1, \dots, T_{p r e d}}

(3)

3.1. Feature Initialization

The objective of the feature initialization layer l is to obtain initial features, denoted as

M_{i}^{l} (t) \in R^{D}

, where the feature of the i-th agent is composed of D feature coordinates, and there are N agents in total. Given the past trajectory of the i-th agent as

X_{i} \in R^{2 \times T_{o b s}}

, the initial feature of the i-th agent at the layer

l + 1

is defined as follows:

\begin{matrix} M_{i}^{l + 1} (t) = Φ^{l} ([D i s t_{i} (t); V_{i} (t); A c c_{i} (t); θ_{i} (t)]) \end{matrix}

(4)

\begin{matrix} {D i s t}_{i} (t) = X_{i} (t) - \bar{X} \end{matrix}

(5)

\begin{matrix} V_{i} (t) = {∥ (X_{i} (t) - X_{i} (t - Δ t)) / Δ t ∥}_{2} \end{matrix}

(6)

\begin{matrix} {A c c}_{i} (t) = {‖ (V_{i} (t) - V_{i} (t - Δ t)) / Δ t ‖}_{2} \end{matrix}

(7)

\begin{matrix} θ_{i} (t) = a t a n 2 (V_{i}^{x} (t), V_{i}^{y} (t)) \end{matrix}

(8)

where

Φ^{l}

is the non-linear transformation function,

[\cdot; \cdot; \cdot; \cdot]

is a concatenation operation, and

\bar{X}

shows the average position of all agents across all past timesteps.

D i s t_{i} (t)

,

V_{i} (t)

,

V_{i}^{x} (i)

,

V_{i}^{y} (i)

, and

A c c_{i} (t)

represent the relative position, velocity, velocity for x coordinate, velocity for y coordinate, and acceleration of agent i at time instant t, respectively.

Δ t

is the time interval,

| | \cdot {| |}_{2}

represents the Euclidean norm (2-norm) of a vector,

θ

represents the current direction of an agent’s motion, and

a t a n 2 (\cdot, \cdot)

is the arctangent function to return the angle for the given coordinates.

3.2. M²Tea: Multi-Head Mask and Temporal Attention

M²Tea is introduced, incorporating time stamping to integrate temporal information and employing a masking mechanism to differentiate between an agent and its surrounding agents. This approach allows the attention mechanism to distinguish agent identities during feature learning and to simultaneously focus on different parts of the input data, thereby capturing the diversity within the data. In pedestrian trajectory prediction, the proposed multi-head mask temporal attention mechanism comprehensively considers various factors affecting pedestrian movement, enhancing the accuracy of future trajectory prediction. The innovative aspects of the module are detailed below.

3.2.1. Time Encoder

In time series problems such as trajectory prediction, traditional attention mechanisms with positional information are insufficient to reflect temporal relationships between elements. Hence, positional encoding with timestamps [29] is introduced. Time encoder involves embedding time-related codes into each input of M²Tea, enabling M²Tea to differentiate between past and future time points. This capability allows us to consider the time serial when processing temporal data. The time encoder is defined as follows:

p t (t, 2 n) = sin (\frac{t}{10000^{\frac{2 n}{d_{m o d e l}}}})

(9)

p t (t, 2 n + 1) = cos (\frac{t}{10000^{\frac{2 n + 1}{d_{m o d e l}}}})

(10)

where

p t (t, 2 n)

and

p t (t, 2 n + 1)

represent the even and odd dimensional encodings of the position at time t, respectively, the dimension of

p t

matches the input dimension of the node embedding vector defined by

d_{m o d e l}

, and n is the position of the embedding vector.

3.2.2. Mutil-Head Mask Attention

Previous models [5,26] on pedestrian trajectory prediction using transformer models often overlooked the issue of losing the social identity information of agents and their surroundings. To address this problem, M²Tea is employed, the principle of which is explained in Figure 4, where the mask is assigned 0 or 1. If the i-th query

q_{i}

and the j-th key

k_{j}

belong to the same type of agent, the mask equals 1; otherwise, it is 0. This differentiation allows M²Tea to focus on an individual agent’s motion characteristics while analyzing interactions with other agents or the environment. The mask mechanism ensures that M²Tea can accurately capture and learn the agent-owned behavioral patterns and trajectory characteristics while understanding how agents interact within a group. Such an approach enhances the comprehensiveness and accuracy of trajectory prediction.

The multi-head mechanism constructs a model with multiple attention heads, and each attention head can learn different spatio-temporal relationships and feature representations. The weighted features calculated by each attention head are fused to form a richer and more diverse overall feature representation. To address complex interactions between multiple agents, given a simplified scenario with agents A and B, their interaction can be influenced by various factors, such as their current positions, relative velocity vectors, and goals.

However, traditional single attention mechanisms can not effectively capture these diverse interaction modes. Multiple heads are created in the multi-head attention mechanism, each focusing on different aspects of interaction information. One head can concentrate on capturing agent A’s potential destination and movement direction, while another could focus on agent B’s velocity and position. This multi-head structure allows M²Tames to simultaneously learn features from these different aspects; thereby, M²Tames can understand the complex interactions between A and B more comprehensively.

Introducing timestamps and a masking mechanism enable M²Tames to clearly distinguish between the identity and temporal information of the multiple agents during the learning process. The advantage of the multi-head mechanism lies in its ability to simultaneously process the information of different scales [23], making M²Tames more flexibly adaptable to the diverse spatio-temporal scales in pedestrian trajectory prediction. This design enhances the adaptability of M²Tames, allowing it to capture the complete dynamics of agent movement more effectively. The specific computational formulas are defined as follows:

A_{i j}^{h} (t) = s o f t m a x ((Q_{i}^{h} (t) \cdot {(K_{j}^{h} (t))}^{T}) / \sqrt{d_{k}})

(11)

Λ_{i j}^{h} (t) = A_{i j}^{h} (t) \cdot M a s k_{i j} (t)

(12)

O_{M^{2} T e a, i} (t) = \sum_{h = 1}^{H} \sum_{j = 1}^{N} Λ_{i j}^{h} (t) \cdot V_{j}^{h} (t)

(13)

where

Q_{i}^{h} (t)

and

K_{j}^{h} (t)

represent the projections of the i-th query and j-th key in the h-th attention head, respectively. The

d_{k}

is the scaling factor to ensure the stability of the attention mechanism.

Λ_{i j}^{h} (t)

is the attention weight considering the temporal mask.

H

is the number of attention heads, and

V_{j}^{h} (t)

is the projection of the j-th value in the h-th head.

3.3. I²: Interactive Inference

In complex pedestrian trajectory prediction scenarios, the interaction relationships between agents are typically implicit and challenging to grasp directly. To effectively address this issue, an innovative Interaction Inference Module is proposed. The core objective of I² is to infer the potential interaction categories between agents, and the various categories of social behavior patterns include avoidance, leading, following, grouping, fishing, schooling, etc.

The I² is designed to infer the interaction relationships among agents i and j (

i, j \in [1, N] \forall i \neq j

), generating interaction weights

C_{i j} \in {[0, 1]}^{K}

, where K is the number of interaction categories. I² calculates and takes the relative position, velocity, acceleration, distance, and direction of agents from past trajectories as the input of

Φ_{c}

, a multilayer perception with

s o f t m a x

activation function. The specific formulas are defined as follows:

\begin{matrix} m_{i j}^{l} = Φ_{m}^{l} (M_{i}^{l}, M_{j}^{l}) \end{matrix}

(14)

\begin{matrix} P_{i}^{l} = \sum_{j \in N (i)} m_{i j}^{l} \end{matrix}

(15)

\begin{matrix} {M_{i}^{l}}^{'} = Φ_{M}^{l} (P_{i}^{l}, M_{i}^{l}) \end{matrix}

(16)

\begin{matrix} C_{i j}^{l} = s o f t m a x (Φ_{c}^{l} ({M_{i}^{l}}^{'}, {M_{j}^{l}}^{'}) / τ) \end{matrix}

(17)

where

m_{i j}^{l}

represents the influence of each agent i on its neighbor j.

Φ_{m}^{l}

is a neural network that processes the features of two agents, including a nonlinear activation function to capture complex interaction patterns between agents.

P_{i}^{l}

represents the aggregation of information from neighboring agents.

N (i)

is the i-th agent’s neighboring set.

Φ_{M}^{l}

transforms

P_{i}^{l}

and the initial features of agents

M_{i}^{l}

into

{M_{i}^{l}}^{'}

, where

Φ_{M}^{l}

is a function implemented by a neural network.

{M_{i}^{l}}^{'}

calculates the aggregation weights, and another MLP transforms features into interaction weights

C_{i j}

. The softmax function converts

Φ_{c}^{l} ({M_{i}^{l}}^{'}, {M_{j}^{l}}^{'})

into a probability distribution, ensuring that the output interaction weights are normalized. And

τ

is the temperature parameter to control the smoothness of the softmax distribution.

3.4. Aggregation

Aggregation operations can learn aggregation weights from the interaction categories obtained through I². These weights, used to aggregate the interaction features of neighboring agents, enable M²Tames to generate more accurate interaction features. This information is crucial for predicting the future movement of agents. The

Φ_{e, k}^{l}

is used to model the mode of interaction. The specific formulas are defined as follows:

e_{i j}^{l} = \sum_{k = 1}^{K} C_{i j, k} Φ_{e, k}^{l} (M_{i}^{l}, M_{j}^{l})

(18)

M_{i}^{l} (t) = M_{i}^{l} (t) + \sum_{j \in N (i)} e_{i j}^{l} \cdot (M_{j}^{l} (t) - M_{i}^{l} (t))

(19)

where

e_{i j}^{l} \in R^{D}

represents the aggregation weight between agents i and j.

Φ_{e, k}^{l}

is an MLP designated for the k-th interaction category, tasked with learning the complexity of the interaction between agents i and j. These weights, used to aggregate the motion features of neighboring agents, effectively capture the dynamics of their interactions.

Through this approach, the interaction weights and categories are integrated to update motion features. This aggregation process not only considers the individual characteristics of agents but also encompasses their relative relationships and the varied impacts of different types of interactions on the update of agent features.

3.5. Goal and Trajectory Prediction Module

The feature map HI from the last layer of the U-Net encoder is concatenated along the channel dimension with the feature distribution map

O_{A g g}

that depicts past motion history. After concatenation, M²Tames obtains a fused tensor HS of dimensions

W \times H \times (C + O_{A g g})

. This tensor is then fed into a U-Net decoder architecture composed of L blocks, where each of the L decoding blocks processes the fused features from H_l and its up-sampling. These decoding blocks employ bilinear interpolation for up-sampling the feature maps and enhance the resolution of feature maps through two consecutive convolution layers followed by ReLU activation functions. The decoder utilizes skip-connection techniques to merge the feature map H_l outputted by the encoder, enriching the detail in the features. Finally, the decoder outputs a Spatial Probability Distribution Map (SPDM) through an output convolution layer followed by a pixel-level sigmoid activation function. The SPDM represents the probability of the observed agents at their final positions

T_{p r e d}

, given the history information from

t = 0

to

T_{o b s}

.

3.6. Loss Function

This study designs a loss function inspired by Y-Net [30]. Specifically, M²Tames is designed as an end-to-end joint training method, which comprehensively considers two core tasks: predicting the distributions of agents’ trajectories and their goals. By introducing weight factors, a weighted combination strategy of binary cross-entropy loss can balance the importance of different tasks. The design of this loss function enables M²Tames to effectively balance the contributions of multiple tasks during the learning process and to more comprehensively understand the complexities of pedestrian trajectory prediction. By integrating considerations for both the predictive goal and trajectory distribution, M²Tames can flexibly adapt to various task requirements, providing robust support for enhancing the accuracy of future pedestrian trajectory predictions. The end-to-end joint training and multi-task learning strategy offer significant advantages for M²Tames’s performance in long-term pedestrian trajectory prediction tasks.

The goal loss, denoted as

L_{g o a l}

, is measured using Binary Cross-Entropy (

B C E

) to quantify the difference between the predicted goal point distribution

\hat{P} (Y_{i}^{T_{{p r e d}^{'}}})

and the actual target point distribution

P (Y_{i}^{T_{p r e d}})

. The trajectory loss

L_{t r a j}

calculates the binary cross-entropy loss between the predicted distribution and the actual distribution for all points from

t = T_{o b s} + 1

to

t = T_{p r e d}

. The total loss

L

is the weighted sum of these two losses. The importance of trajectory loss within the total loss is adjusted by the weight parameter

λ

. This paper utilizes the Adam optimization method to dynamically optimize the value of

λ

, achieving a more precise control of loss weights. The specific formulas are defined as follows:

L_{g o a l} = \frac{1}{N} \sum_{i = 1}^{N} B C E (P (Y_{i}^{T_{p r e d}}), \hat{P} (Y_{i}^{T_{{p r e d}^{'}}}))

(20)

L_{t r a j} = \frac{\sum_{i = 1}^{N} \sum_{t = T_{o b s} + 1}^{T_{p r e d}} B C E (P (Y_{i}^{t}), \hat{P} (Y_{i}^{t^{'}}))}{N \times (T_{p r e d} - T_{o b s})}

(21)

L = L_{g o a l} + λ L_{t r a j}

(22)

where N represents the number of agents.

P

represents the true distribution.

\hat{P}

represents the distribution predicted by the model.

Y_{i}^{T_{p r e d}}

represents the true target points.

Y_{i}^{T_{{p r e d}^{'}}}

represents the predicted target points.

4. Results

4.1. Dataset and Experiments

4.1.1. Dataset

The Stanford Drone Dataset (SDD) [31] is widely utilized in computer vision and machine learning, focusing on pedestrian trajectory prediction in urban environments. The SDD has become an essential tool for pedestrian trajectory prediction research, especially in high-density and dynamically changing urban scenarios. Captured by drones from Stanford University at various campus locations, the dataset spans a variety of urban settings, from academic areas to residential dining zones and major roadways.

The SDD dataset includes 1136 trajectories of students in campus scenes, sampled at 30 FPS. For short-term predictions, following [32], the data are subsampled to achieve the recommended frame rate of 2.5 FPS. The model observes for 3.2 s and predicts the subsequent 4.8 s. The midpoint of the original bounding boxes is used to achieve a coordinate representation that is consistent with the preprocessed short-term settings.

ETH/UCY Dataset is widely utilized in computer vision and machine learning, which includes pedestrian trajectory data collected from ETH Zurich and the University of Cyprus (UCY). The dataset encompasses five different public scenarios: a busy city center (ETH scene), a spacious hotel environment (HOTEL scene), a university campus (UNIV scene), and two shopping areas (ZARA1 and ZARA2 scenes). Each scenario provides multiple trajectories of pedestrians moving under natural conditions, manually annotated through video analysis. The diversity of these scenarios and the complexity of pedestrian interactions make the ETH/UCY dataset an ideal choice for evaluating and testing pedestrian trajectory prediction models, particularly in analyzing the impact of social interactions and environmental factors on pedestrian behavior.

4.1.2. Metrics

Two error metrics are employed to evaluate the effectiveness of M²Tames: Average Displacement Error (ADE) and Final Displacement Error (FDE). ADE measures the mean Euclidean distance between all corresponding positions in the predicted trajectory and the ground truth trajectory, reflecting the overall alignment of the predicted trajectory with the actual trajectory. The specific formula for ADE is defined as follows:

A D E = \frac{\sum_{i = 1}^{N} \sum_{t = T_{o b s} + 1}^{T_{p r e d}} | | (x_{t}^{i}, y_{t}^{i}) - ({\tilde{x}}_{t}^{i}, {\tilde{y}}_{t}^{i}) {| |}_{2}}{N \times (T_{p r e d} - T_{o b s})}

(23)

Final Displacement Error (FDE) considers the Euclidean distance between the endpoint of the predicted and the actual trajectory, reflecting the accuracy of the endpoint position prediction. The specific formula is as follows:

F D E = \frac{1}{N} \sum_{i = 1}^{N} | | (x_{T_{p r e d}}^{i}, y_{T_{p r e d}}^{i}) - ({\tilde{x}}_{T_{p r e d}}^{i}, {\tilde{y}}_{T_{p r e d}}^{i}) {| |}_{2}

(24)

Given that experiments take place in a stochastic environment, ADE and FDE are reported based on the minimum error values among 20 prediction samples. This approach generates 20 prediction samples for each input trajectory and selects the sample with the least error. This method provides a more comprehensive assessment of the model’s predictive accuracy under varied conditions.

4.1.3. Data Preprocessing

To adapt to model requirements, to enhance algorithm efficiency, and to address computational resource limitations when processing large datasets, preprocessing is conducted on RGB scene images

I_{r g b}

and trajectory heatmaps

H_{m a p}

. For the datasets, a quarter of the samples are randomly selected from the original data for both

I_{r g b}

and

H_{m a p}

. All images and trajectories are standardized using a division factor and resized with a scale factor of 0.2, along with perspective transformation processing.

Various data augmentation techniques are employed to increase the diversity of training data. First, all scene images and trajectories undergo horizontal and vertical flips at

90^{°}

,

180^{°}

, and

270^{°}

. Second, a Gaussian convolution kernel, with a kernel size of 31 and a standard deviation of 4, is used for image smoothing, adjusting contrast levels between pixels, and changing the color distribution and overall brightness levels of the images.

Through these methods, the quantity of training data becomes eightfold to provide large training samples to enhance M²Tames’s performance and generalization capability. Such data augmentation strategies not only alleviate overfitting issues but also enable M²Tames to better adapt to various scenes and trajectory conditions.

4.1.4. Experimental Parameters

In all experiments, during the attention calculation phase, the dropout rate is set to 0.1, and the number of heads for multi-head agent-aware attention is 4. This setup allows mapping trajectory to coordinate into different feature spaces to study the mutual influences between trajectories. In the interaction and aggregation learning phases, the numbers of feature learning layers are all set to four. The input for each layer is the output from the previous layer, iterating four times.

In the training stage, M²Tames uses an Adam optimizer as the optimization algorithm for the loss function’s gradient descent, with a weight decay coefficient set at

1 \times 10^{- 3}

. The training consists of 300 iterations with a batch size of 32. All Multi-Layer Perceptrons (MLPs) in M²Tames have two layers, each equipped with the ReLU activation function. M²Tames, built using Python 3.8 on Pytorch 1.7.1, is trained on a single NVIDIA RTX-4090 GPU.

4.2. Quantitative Results

Table 1 presents the results for the Stanford Drone Dataset (SDD) in short-term settings, specifically with

t_{p} = 3.2

s and

t_{f} = 4.8

s. According to the established protocol delineated in Trajnet [32], leave-one-out cross-validation was employed. With 20 prediction samples, M²Tames achieved an Average Displacement Error (ADE) of 8.02 and a Final Displacement Error (FDE) of 12.31, surpassing all competing models. Notably, it outperformed the latest MSRL [6] by margins of 2.49% for ADE and 8.77% for FDE. The model produces a separate prediction of

K_{a}

for each given goal. In short time horizons, because the total path is short, the paths to a given goal are similar to each other. Thus, the model is set at

K_{a} = 1

.

In short-term pedestrian trajectory prediction, M²Tames demonstrates significant advantages through its innovative combination of M²Tea, I², and the U-Net architectures. This integration enables the model to effectively capture and leverage pedestrians’ historical trajectory, particularly in applications involving complex interactions and dynamic environments. Both the ADE and FDE metrics demonstrate that the model achieves lower errors in predicting short-term pedestrian trajectories. This underscores its advanced capability in understanding pedestrian behavior dynamics, as well as their interactions with the environment and other pedestrians. Consequently, M²Tames exhibits strong practicality and broad application potential in handling short-term prediction tasks.

Table 2 presents the results for the short-term setup on the ETH/UCY datasets. Employing the standard protocol from Trajnet [32], which includes leave-one-out cross-validation and uses 20 prediction samples, M²Tames achieves a mean displacement error (ADE) of 0.18 and a final displacement error (FDE) of 0.28. These metrics surpass those of all competing models, including the recent MSRL.

An analysis of experimental results indicates that the model exhibits exceptional performance on the Hotel, Zara1, and Zara2 datasets. An in-depth examination of these datasets reveals that the model performs particularly well in scenarios involving dense crowds. This superior performance is attributed to the innovative modules introduced, which enable a more precise depiction of interactions among agents, thereby enhancing the model’s predictive accuracy.

M²Tames not only achieves technical innovation but also demonstrates significant effectiveness in practical applications. By integrating advanced attention mechanisms, the model not only excels on a single dataset but also maintains a high level of predictive accuracy across various environments and dynamic conditions. Moreover, the model’s consistent performance across different scenarios provides strong evidence of its scalability and adaptability for real-world applications. These results offer a new technical pathway for pedestrian trajectory prediction and hold important implications and application value for research in intelligent transportation systems, urban planning, and security monitoring.

Table 3 presents the results for the Stanford Drone Dataset (SDD) in long-term settings, predicting future trajectories for

t_{f} = 30

s, based on a past motion history of

t_{p} = 5

s. All reported errors are in pixels, with lower values indicating better performance. M²Tames achieved an Average Displacement Error (ADE) of 44.85 and a Final Displacement Error (FDE) of 65.97. The proposed model, M²Tames, surpasses all the others, outperforming the best-performing Y-Net [30] by 6.89% (ADE) and 1.12% (FDE). Y-Net introduced a long-term trajectory prediction task with a prediction range of up to 30 s. For benchmarking purposes, the results of Y-Net [30] are assessed using the published source code, while the results for other models come from [30].

For long-term pedestrian trajectory prediction, M²Tames demonstrates its exceptional performance in extended sequence forecasting by adjusting the

K_{a}

parameter. As the

K_{a}

value increases, the model not only maintains high accuracy but also significantly reduces ADE and FDE metrics, even in the most challenging long-term prediction scenarios. This reflects the model’s effective handling of long-term dependencies. Such performance advantages are particularly suited for applications requiring highly accurate long-term predictions, such as intelligent transportation systems and urban monitoring. This underscores the practical value of M²Tames in urban planning and security monitoring systems.

M²Tames, through its innovative multi-head mask temporal attention mechanism and U-Net architecture, demonstrates superior performance in both short-term and long-term pedestrian trajectory prediction tasks. These performance metrics validate the model’s effectiveness in understanding pedestrian dynamics and interactions in complex scenarios. Whether in short-term or long-term prediction tasks, M²Tames provides robust data support and decision-making foundations for applications.

Further analysis is conducted on the impact of different spatial layers L within the multi-layer spatial module on the model’s performance. As shown in Figure 5, there is a noticeable downward trend in both Average Displacement Error (ADE) and Final Displacement Error (FDE) with the increase in the number of spatial layers. However, as L increases, a significant amount of time and space is required for training and testing. Observing the trend, it is evident that the rate of error reduction diminishes. The benefits gained from adding more layers do not justify the additional time and space required. Therefore, this study limits the number of layers to four.

The generalizability of the model discussed in this paper was also tested based on the number of samples

K_{a}

used during testing. As depicted in Figure 6, both Average Displacement Error (ADE) and Final Displacement Error (FDE) decrease as

K_{a}

increases. Moreover, under the same number of samples, the model presented in this paper outperforms the MSRL model, indicating that our model achieves the same level of error with fewer samples. These results demonstrate that the M²Tames introduced captures the potential interactions among agents as well as between agents and the environment more accurately, reducing the variance in predicted trajectory distributions. This leads to an enhanced accuracy and a better generalization capability of the predictions.

4.3. Ablation Study

Table 4 illustrates the results of M²Tames structure’s ablation study. The components,

M^{2}

for the multi-head mask temporal attention module,

I^{2}

for the Interaction Inference Module, and U for the U-Net architecture, are evaluated to demonstrate their individual and combined impacts on the Average Displacement Error (ADE) and Final Displacement Error (FDE). The outcomes indicate that integrating the agent’s motion characteristics, interaction among agents, and interactions with the environment produces the best results, and removing any component from M²Tames leads to a decrease in performance. These results validate the contribution of each module to M²Tames. The effectiveness of the proposed model M²Tames is thus demonstrated.

From the sustained performance improvements introduced by the M²Tea and I² components, it is evident that the model focuses on deepening the understanding of interactions among agents. The model has targeted and innovative enhancements in this area. Furthermore, the model integrates semantic information from the environmental context, a distinctive combination that enables a more comprehensive analysis and prediction of agents’ future trajectories. This approach, which holistically considers agent interactions and environmental factors, not only enhances the accuracy of trajectory prediction but also boosts the model’s applicability and robustness in complex scenarios.

4.4. Qualitative Results

The predicted trajectories are visualized in several scenarios to illustrate the validity of our M²Tames compared with MSRL, S-GAN, and STGCNN. Figure 7 shows that M²Tames successfully considers the motion characteristics of the agent itself, along with interactions among agents and between agents and their environment, and that it improves the accuracy and adaptability of trajectory prediction. Figure 8 shows the predicted trajectories in both scenarios, indicating that M²Tames makes more realistic trajectory predictions. Figure 8a shows the DeathCircle image: in a complex circular scene that contains many agents, the prediction results of M²Tames show less bias and better convergence to the ground truth. Figure 8b shows the Hyang image: both M²Tames and others are close to the ground truth, but M²Tames presents more stable trajectories with lower amplitude oscillations.

5. Conclusions

In the field of pedestrian trajectory prediction, existing research has made significant progress, particularly in modeling social interactions among agents and the influence of the environment. However, as urban environments become increasingly complex, pedestrian trajectory prediction models need to more comprehensively capture and understand the motion characteristics of agents, the intricate interactions among agents, and agents’ interactions with the environment. The improved M²Tea and I² proposed in this study, combined with the interactive inference module and U-Net architecture, further enhance the characterization of these complex relationships. By accurately capturing pedestrians’ historical trajectory features and the potential interactions among agents, the proposed approach not only excels in prediction accuracy but also demonstrates high adaptability, effectively operating in dynamically changing environments. Compared to other models, the main contribution of this model lies in its comprehensive consideration of various factors that could influence an agent’s future trajectory. It not only analyzes the interactions among agents but also enhances the model’s ability to learn from environmental contextual information to better understand how these factors affect the agents’ future paths. Furthermore, by inferring the potential movement patterns among agents, the model significantly increases the precision and reliability of predictions.

Rigorous testing on the ETH/UCY and SDD datasets, along with comprehensive ablation studies, provides strong empirical support for the effectiveness of the model. This research not only validates theoretical advancements but also demonstrates practical improvements in real-world settings. Long-term and short-term trajectory prediction studies are both conducted. Ablation studies validate the effectiveness of the proposed model. Visualization results demonstrate its capability to predict multimodal trajectories, and the model outperforms recent state-of-the-art models on the ETH/UCY and SDD datasets.

The outcomes of this research are crucial for exploring the application of advanced machine learning techniques in autonomous driving systems, such as path planning for autonomous vehicles and robots. By predicting and interpreting complex pedestrian movements, these technologies can significantly improve the safety and efficiency of related systems.

Author Contributions

Conceptualization, X.G. and Y.W.; investigation, Y.W., Y.Z., Y.L. and G.W.; data curation, Y.W., Y.Z. and Y.L.; writing—original draft preparation, X.G. and Y.W.; writing—review and editing, X.G., Y.W., Y.Z., Y.L. and G.W.; visualization, Y.W.; supervision, X.G.; funding acquisition, G.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the Zhengzhou Major Science and Technology Project (No. 2021KJZX0060) and the Technology Special Projects in Henan Province (No. 221100210600).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Acknowledgments

The authors would like to express their sincere gratitude to Jiandong Shang from Zhengzhou University for his invaluable supervision and guidance throughout the course of this research. We also wish to thank Wei Cen and Chunmin Zhang from the Yutong Bus Co., Ltd., as well as Xiangdong Liu from the Geophysical Exploration Research Institute of Zhongyuan Oilfield Company, for their significant project administration, which was crucial to the completion of this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Alahi, A.; Goel, K.; Ramanathan, V.; Robicquet, A.; Fei-Fei, L.; Savarese, S. Social lstm: Human trajectory prediction in crowded spaces. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 961–971. [Google Scholar]
Gupta, A.; Johnson, J.; Fei-Fei, L.; Savarese, S.; Alahi, A. Social gan: Socially acceptable trajectories with generative adversarial networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2255–2264. [Google Scholar]
Zhang, P.; Ouyang, W.; Zhang, P.; Xue, J.; Zheng, N. Sr-lstm: State refinement for lstm towards pedestrian trajectory prediction. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 12085–12094. [Google Scholar]
Kalatian, A.; Farooq, B. A context-aware pedestrian trajectory prediction framework for automated vehicles. Transp. Res. Part C Emerg. Technol. 2022, 134, 103453. [Google Scholar] [CrossRef]
Yuan, Y.; Weng, X.; Ou, Y.; Kitani, K.M. Agentformer: Agent-aware transformers for socio-temporal multi-agent forecasting. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 9813–9823. [Google Scholar]
Wu, Y.; Wang, L.; Zhou, S.; Duan, J.; Hua, G.; Tang, W. Multi-stream representation learning for pedestrian trajectory prediction. Proc. AAAI Conf. Artif. Intell. 2023, 37, 2875–2882. [Google Scholar] [CrossRef]
Yu, C.; Ma, X.; Ren, J.; Zhao, H.; Yi, S. Spatio-temporal graph transformer networks for pedestrian trajectory prediction. In Computer Vision—ECCV 2020: Proceedings of the 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XII 16; Springer: Cham, Switzerland, 2020; pp. 507–523. [Google Scholar]
Sadeghian, A.; Kosaraju, V.; Sadeghian, A.; Hirose, N.; Rezatofighi, H.; Savarese, S. Sophie: An attentive gan for predicting paths compliant to social and physical constraints. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 1349–1358. [Google Scholar]
Salzmann, T.; Ivanovic, B.; Chakravarty, P.; Pavone, M. Trajectron++: Dynamically-feasible trajectory forecasting with heterogeneous data. In Computer Vision—ECCV 2020: Proceedings of the 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XVIII 16; Springer: Cham, Switzerland, 2020; pp. 683–700. [Google Scholar]
Li, J.; Ma, H.; Tomizuka, M. Conditional generative neural system for probabilistic trajectory prediction. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019; pp. 6150–6156. [Google Scholar]
Mangalam, K.; Girase, H.; Agarwal, S.; Lee, K.H.; Adeli, E.; Malik, J.; Gaidon, A. It is not the journey but the destination: Endpoint conditioned trajectory prediction. In Computer Vision—ECCV 2020: Proceedings of the 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part II 16; Springer: Cham, Switzerland, 2020; pp. 759–776. [Google Scholar]
Xu, C.; Tan, R.T.; Tan, Y.; Chen, S.; Wang, Y.G.; Wang, X.; Wang, Y. EqMotion: Equivariant Multi-agent Motion Prediction with Invariant Interaction Reasoning. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 1410–1420. [Google Scholar]
Helbing, D.; Molnar, P. Social force model for pedestrian dynamics. Phys. Rev. E 1995, 51, 4282. [Google Scholar] [CrossRef] [PubMed]
Löhner, R. On the modeling of pedestrian motion. Appl. Math. Model. 2010, 34, 366–382. [Google Scholar] [CrossRef]
Amirian, J.; Hayet, J.B.; Pettré, J. Social Ways: Learning Multi-Modal Distributions of Pedestrian Trajectories With GANs. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Long Beach, CA, USA, 16–17 June 2019; pp. 2964–2972. [Google Scholar] [CrossRef]
Meng, M.; Wu, Z.; Chen, T.; Cai, X.; Zhou, X.; Yang, F.; Shen, D. Forecasting human trajectory from scene history. Adv. Neural Inf. Process. Syst. 2022, 35, 24920–24933. [Google Scholar]
Bera, A.; Kim, S.; Randhavane, T.; Pratapa, S.; Manocha, D. GLMP- realtime pedestrian path prediction using global and local movement patterns. In Proceedings of the 2016 IEEE International Conference on Robotics and Automation (ICRA), Stockholm, Sweden, 16–21 May 2016; pp. 5528–5535. [Google Scholar] [CrossRef]
Goli, S.A.; Far, B.H.; Fapojuwo, A.O. Vehicle Trajectory Prediction with Gaussian Process Regression in Connected Vehicle Environment. In Proceedings of the 2018 IEEE Intelligent Vehicles Symposium (IV), Changshu, China, 26–30 June 2018; pp. 550–555. [Google Scholar] [CrossRef]
Rudenko, A.; Palmieri, L.; Doellinger, J.; Lilienthal, A.J.; Arras, K.O. Learning Occupancy Priors of Human Motion From Semantic Maps of Urban Environments. IEEE Robot. Autom. Lett. 2021, 6, 3248–3255. [Google Scholar] [CrossRef]
Rudenko, A.; Palmieri, L.; Arras, K.O. Joint Long-Term Prediction of Human Motion Using a Planning-Based Social Force Approach. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, QLD, Australia, 21–25 May 2018; pp. 4571–4577. [Google Scholar] [CrossRef]
Tran, H.; Le, V.; Tran, T. Goal-driven Long-Term Trajectory Prediction. In Proceedings of the 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2021; pp. 796–805. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: Proceedings of the 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Young, T.; Hazarika, D.; Poria, S.; Cambria, E. Recent trends in deep learning based natural language processing. IEEE Comput. Intell. Mag. 2018, 13, 55–75. [Google Scholar] [CrossRef]
Yao, H.Y.; Wan, W.G.; Li, X. End-to-end pedestrian trajectory forecasting with transformer network. ISPRS Int. J. Geo-Inf. 2022, 11, 44. [Google Scholar] [CrossRef]
Lv, P.; Wang, W.; Wang, Y.; Zhang, Y.; Xu, M.; Xu, C. SSAGCN: Social soft attention graph convolution network for pedestrian trajectory prediction. IEEE Trans. Neural Netw. Learn. Syst. 2023, 35, 11989–12003. [Google Scholar] [CrossRef] [PubMed]
Wang, Y.; Mohamed, A.; Le, D.; Liu, C.; Xiao, A.; Mahadeokar, J.; Huang, H.; Tjandra, A.; Zhang, X.; Zhang, F.; et al. Transformer-based acoustic modeling for hybrid speech recognition. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 6874–6878. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Giuliari, F.; Hasan, I.; Cristani, M.; Galasso, F. Transformer Networks for Trajectory Forecasting. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 10335–10342. [Google Scholar] [CrossRef]
Mangalam, K.; An, Y.; Girase, H.; Malik, J. From goals, waypoints & paths to long term human trajectory forecasting. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 15233–15242. [Google Scholar]
Robicquet, A.; Sadeghian, A.; Alahi, A.; Savarese, S. Learning social etiquette: Human trajectory understanding in crowded scenes. In Computer Vision—ECCV 2016: Proceedings of the 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part VIII 14; Springer: Cham, Switzerland, 2016; pp. 549–565. [Google Scholar]
Sadeghian, A.; Kosaraju, V.; Gupta, A.; Savarese, S.; Alahi, A. Trajnet: Towards a benchmark for human trajectory prediction. arXiv 2018, 2374–2383. [Google Scholar]
Bhattacharyya, A.; Hanselmann, M.; Fritz, M.; Schiele, B.; Straehle, C.N. Conditional flow variational autoencoders for structured sequence prediction. arXiv 2019, arXiv:1908.09008. [Google Scholar]
Dendorfer, P.; Osep, A.; Leal-Taixé, L. Goal-gan: Multimodal trajectory prediction based on goal position estimation. In Computer Vision—ACCV 2020: Proceedings of the 15th Asian Conference on Computer Vision, Kyoto, Japan, 30 November–4 December 2020; Springer: Cham, Switzerland, 2021. [Google Scholar]
Deo, N.; Trivedi, M.M. Trajectory forecasts in unknown environments conditioned on grid-based plans. arXiv 2020, arXiv:2001.00735. [Google Scholar]
Liang, J.; Jiang, L.; Hauptmann, A. Simaug: Learning robust representations from 3d simulation for pedestrian trajectory prediction in unseen cameras. arXiv 2020, arXiv:2004.02022. [Google Scholar]
Dendorfer, P.; Elflein, S.; Leal-Taixé, L. Mg-gan: A multi-generator model preventing out-of-distribution samples in pedestrian trajectory prediction. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 13158–13167. [Google Scholar]
Pang, B.; Zhao, T.; Xie, X.; Wu, Y.N. Trajectory prediction with latent belief energy-based model. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 11814–11824. [Google Scholar]
Li, R.; Katsigiannis, S.; Shum, H.P. Multiclass-SGCN: Sparse Graph-Based Trajectory Prediction with Agent Class Embedding. In Proceedings of the 2022 IEEE International Conference on Image Processing (ICIP), Bordeaux, France, 16–19 October 2022; pp. 2346–2350. [Google Scholar]
Sighencea, B.I.; Stanciu, I.R.; Căleanu, C.D. D-STGCN: Dynamic Pedestrian Trajectory Prediction Using Spatio-Temporal Graph Convolutional Networks. Electronics 2023, 12, 611. [Google Scholar] [CrossRef]
Huang, Y.; Bi, H.; Li, Z.; Mao, T.; Wang, Z. Stgat: Modeling spatial-temporal interactions for human trajectory prediction. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6272–6281. [Google Scholar]
Wong, C.; Xia, B.; Hong, Z.; Peng, Q.; Yuan, W.; Cao, Q.; Yang, Y.; You, X. View Vertically: A hierarchical network for trajectory prediction via fourier spectrums. In Computer Vision—ECCV 2022: Proceedings of the 17th European Conference, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 682–700. [Google Scholar]

Figure 1. Example of pedestrian motion trajectories in a real-world scenario. Pedestrian trajectory prediction should consider information such as interactions among agents and with the scene environment.

Figure 2. Pedestrian trajectory prediction model architecture.

Figure 3. The model architecture comprises two primary components: (1) the Interaction Module and (2) the Goal and Trajectory Prediction Module.

Figure 4. Mask attention. Through masked matrices, models can identify previously forgotten agent identity information, allowing different agents to use different attention weights.

Figure 5. The effect of the number of layers on the result.

Figure 6. The influence of different sample numbers on the result.

Figure 7. Trajectory visualization. M²Tames successfully considers the motion characteristics of the agent itself, as well as interactions among agents and between agents and their environment.

Figure 8. Trajectory visualization. M²Tames successfully considers the motion characteristics of the agent itself, as well as interactions among agents and between agents and their environment.

Table 1. Short-term pedestrian trajectory prediction on the SDD dataset.

Method	Year	ADE	FDE
Social-LSTM [1]	2016	57.00	31.20
Social-GAN [2]	2018	27.23	41.44
CF-VAE [33]	2019	12.60	22.30
Goal-GAN [34]	2020	12.20	22.10
P2TIRL [35]	2020	12.58	22.07
PECNET [11]	2020	9.96	15.88
SimAug [36]	2020	10.27	19.71
MG-GAN [37]	2021	13.60	25.80
LB-EBM [38]	2021	8.87	15.61
CAGN [38]	2022	9.42	15.93
Multiclass-SGCN [39]	2022	14.36	25.99
D-STGCN [40]	2023	15.18	25.50
MSRL [6]	2023	8.22	13.39
M²Tames	2024	8.02	12.31

Table 2. Short-term pedestrian trajectory prediction on the ETH/UCY dataset.

Method	ETH	HOTEL	UNIV	ZARA1	ZARA2	AVG
Social-LSTM [1]	1.09/2.35	0.79/1.76	0.67/1.40	0.47/1.00	0.56/1.17	0.72/1.54
Social-GAN [2]	0.81/1.52	0.72/1.61	0.60/1.26	0.34/0.69	0.42/0.84	0.58/1.18
ST-GAT [41]	0.65/1.12	0.35/0.66	0.52/1.10	0.34/0.69	0.29/0.60	0.43/0.83
Transformer-TF [42]	0.61/1.12	0.18/0.30	0.35/0.65	0.22/0.38	0.17/0.32	0.31/0.55
STAR [40]	0.36/0.65	0.17/0.36	0.31/0.62	0.26/0.55	0.22/0.46	0.26/0.53
Trajectron++ [9]	0.39/0.83	0.12/0.21	0.20/0.44	0.15/0.33	0.11/0.25	0.19/0.41
Goal-GAN [34]	0.59/1.18	0.19/0.35	0.60/1.19	0.43/0.87	0.32/0.65	0.43/0.85
PECNet [11]	0.54/0.87	0.18/0.24	0.35/0.60	0.22/0.39	0.17/0.30	0.29/0.48
$V^{2}$ -Net [42]	0.23/0.37	0.11/0.16	0.21/0.35	0.19/0.30	0.14/0.24	0.18/0.28
MG-GAN [37]	0.47/0.91	0.14/0.24	0.54/1.07	0.36/0.73	0.29/0.60	0.36/0.71
D-STGCN [40]	0.63/1.03	0.37/0.58	0.46/0.78	0.35/0.56	0.29/0.48	0.42/0.68
MSRL [6]	0.28/0.47	0.14/0.22	0.24/0.43	0.17/0.30	0.14/0.23	0.19/0.33
M²Tames	0.27/0.39	0.11/0.16	0.21/0.43	0.17/0.26	0.15/0.20	0.18/0.28

Table 3. Long-term pedestrian trajectory prediction.

Method	Ka	ADE	FDE
Social-GAN [2]	1	155.32	307.88
PECNET [11]	1	72.22	118.13
Y-net [30]	1	47.94	66.71
M²Tames	1	44.85	65.97
M²Tames	2	44.91	62.65
M²Tames	5	39.23	64.33

Table 4. Ablation study.

Components				Performance
Variant	$M^{2}$	$I^{2}$	U	ADE	FDE
(1)	×	×	√	10.85	21.85
(2)	√	×	√	8.75	13.83
our	√	√	√	8.02	12.31

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gao, X.; Wang, Y.; Zhao, Y.; Li, Y.; Wu, G. M²Tames: Interaction and Semantic Context Enhanced Pedestrian Trajectory Prediction. Appl. Sci. 2024, 14, 8497. https://doi.org/10.3390/app14188497

AMA Style

Gao X, Wang Y, Zhao Y, Li Y, Wu G. M²Tames: Interaction and Semantic Context Enhanced Pedestrian Trajectory Prediction. Applied Sciences. 2024; 14(18):8497. https://doi.org/10.3390/app14188497

Chicago/Turabian Style

Gao, Xu, Yanan Wang, Yaqian Zhao, Yilong Li, and Gang Wu. 2024. "M²Tames: Interaction and Semantic Context Enhanced Pedestrian Trajectory Prediction" Applied Sciences 14, no. 18: 8497. https://doi.org/10.3390/app14188497

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

M2Tames: Interaction and Semantic Context Enhanced Pedestrian Trajectory Prediction

Abstract

1. Introduction

2. Related Work

2.1. Interaction Modeling

2.2. Long-Term Trajectory Prediction

2.3. Transformer

3. Proposed Approach

3.1. Feature Initialization

3.2. M2Tea: Multi-Head Mask and Temporal Attention

3.2.1. Time Encoder

3.2.2. Mutil-Head Mask Attention

3.3. I2: Interactive Inference

3.4. Aggregation

3.5. Goal and Trajectory Prediction Module

3.6. Loss Function

4. Results

4.1. Dataset and Experiments

4.1.1. Dataset

4.1.2. Metrics

4.1.3. Data Preprocessing

4.1.4. Experimental Parameters

4.2. Quantitative Results

4.3. Ablation Study

4.4. Qualitative Results

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

M²Tames: Interaction and Semantic Context Enhanced Pedestrian Trajectory Prediction

3.2. M²Tea: Multi-Head Mask and Temporal Attention

3.3. I²: Interactive Inference