An Algorithm for Predicting Vehicle Behavior in High-Speed Scenes Using Visual and Dynamic Graphical Neural Network Inference

Li, Menghao; Liu, Miao; Zhang, Weiwei; Guo, Wenfeng; Chen, Enqing; Hu, Chunguang; Zhang, Maomao

doi:10.3390/app14198873

Open AccessArticle

An Algorithm for Predicting Vehicle Behavior in High-Speed Scenes Using Visual and Dynamic Graphical Neural Network Inference

by

Menghao Li

¹,

Miao Liu

^1,*,

Weiwei Zhang

²,

Wenfeng Guo

³,

Enqing Chen

⁴

,

Chunguang Hu

⁵

and

Maomao Zhang

⁶

¹

School of Mechanical and Automotive Engineering, Shanghai University of Engineering Science, Shanghai 201620, China

²

Shanghai Smart Vehicle Cooperating Innovation Center Co., Ltd., Shanghai 201805, China

³

School of the Vehicle and Mobility, Tsinghua University, Beijing 100084, China

⁴

School of Education and Foreign Languages, Wuhan Donghu University, Wuhan 430212, China

⁵

School of Urban Planning and Design, Peking University, Shenzhen 518055, China

⁶

College of Public Administration, Huazhong University of Science and Technology, Wuhan 430079, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(19), 8873; https://doi.org/10.3390/app14198873

Submission received: 12 August 2024 / Revised: 26 September 2024 / Accepted: 28 September 2024 / Published: 2 October 2024

(This article belongs to the Topic New Technological Solutions, Research Methods, Simulation and Analytical Models That Support the Development of Modern Transport Systems)

Download

Browse Figures

Versions Notes

Abstract

:

Accidents caused by vehicles changing lanes occur frequently on highways. Moreover, frequent lane changes can severely impact traffic flow during peak commuting hours and on busy roads. A novel framework based on a multi-relational graph convolutional network (MR-GCN) is herein proposed to address these challenges. First, a dynamic multilevel relational graph was designed to describe interactions between vehicles and road objects at different spatio-temporal granularities, with real-time updates to edge weights to enhance understanding of complex traffic scenarios. Second, an improved spatio-temporal interaction graph generation method was introduced, focusing on spatio-temporal variations and capturing complex interaction patterns to enhance prediction accuracy and adaptability. Finally, by integrating a dynamic multi-relational graph convolutional network (DMR-GCN) with dynamic scene sensing and interaction learning mechanisms, the framework enables real-time updates of complex vehicle relationships, thereby improving behavior prediction’s accuracy and real-time performance. Experimental validation on multiple benchmark datasets, including KITTI, Apollo, and Indian, showed that our algorithmic framework achieves significant performance improvements in vehicle behavior prediction tasks, with Map, Recall, and F1 scores reaching 90%, 88%, and 89%, respectively, outperforming existing algorithms. Additionally, the model achieved a Map of 91%, a Recall of 89%, and an F1 score of 90% under congested road conditions in a self-collected high-speed traffic scenario dataset, further demonstrating its robustness and adaptability in high-speed traffic conditions. These results show that the proposed model is highly practical and stable in real-world applications such as traffic control systems and self-driving vehicles, providing strong support for efficient vehicle behavior prediction.

Keywords:

graph neural network; vehicle intention; autonomous driving; forecasting; intelligent transportation system

1. Introduction

According to statistics, there are approximately 1.2 million high-speed traffic accidents yearly, resulting in numerous casualties. More than 30% of these accidents are caused by vehicle lane changing. The high incidence and severity of these accidents have prompted researchers to focus more on the study of vehicle behavior prediction. In recent years, with the rapid development of autonomous driving technology, the application of vehicle behavior prediction in high-speed scenarios has attracted considerable attention. Frequent vehicle lane changes lead to a decrease in overall traffic mobility and increase the likelihood of traffic accidents. Therefore, accurate vehicle behavior prediction can enhance the safety and efficiency of autonomous driving systems and lay a solid foundation for future intelligent transportation systems.

In autonomous driving, the acquisition of environmental information through advanced sensing technologies such as LiDAR, radar, and cameras is essential for predicting vehicle behavior [1,2]. Convolutional neural networks (CNNs) are widely employed to extract visual features for this purpose. However, reliance solely on visual data presents limitations, particularly in high-speed scenarios characterized by complex vehicle interactions [3,4], often resulting in diminished prediction accuracy in dynamic environments [5]. To enhance prediction accuracy, researchers have investigated dynamic graph neural networks (DGNNs), which effectively model spatio-temporal dependencies by capturing dynamic interactions between vehicles [6,7].

Approaches that combine graph convolutional networks (GCNs) with long short-term memory (LSTM) or temporal convolutional networks (TCNs) have shown promise in modeling these spatio-temporal dependencies [8,9]. Despite these advancements, existing methods often struggle to accurately represent non-Euclidean spatial relationships between vehicles and to account for critical factors such as lane positioning and traffic flow in high-speed contexts [10,11]. To address these challenges, researchers have proposed a method that integrates visual data with dynamic graph neural networks, employing CNNs to extract vehicle features and DMR-GCN to more effectively capture spatio-temporal dependencies [12,13]. This approach constructs a traffic scene graph to represent vehicle layout and motion, thereby improving predictions in complex, dynamic environments [14,15]. By effectively modeling non-Euclidean spatial relationships, the proposed method overcomes the limitations of existing techniques and enhances the accuracy of vehicle behavior predictions in high-speed scenarios [16,17].

The integration of advanced graph neural networks with visual features opens new possibilities for precise and reliable behavior prediction, even under challenging conditions [18,19]. Studies indicate that incorporating factors such as vehicle distance, lane position, and dynamic traffic flow significantly enhances prediction models [20,21]. Future research may further explore hybrid methods that combine visual data with advanced neural network architectures to improve the robustness of vehicle behavior prediction [22].

The aim of the current study was to enhance the accuracy of vehicle behavior prediction in high-speed scenarios, addressing traffic accidents caused by lane changes. By integrating visual perception with a dynamic graph neural network (DMR-GCN) to construct a multilevel graph model that captures vehicle interactions, an adaptive network tuning mechanism and a dynamic scene learning strategy are herein introduced to improve prediction accuracy and safety. This approach addresses the shortcomings of existing methods in modeling complex interactions. It critically supports advancing autonomous driving and intelligent transportation systems, demonstrating significant practical value and broad application potential.

The contributions made include the following:

(1): Visual and graph neural combination: integrating visual perception with dynamic graph neural networks (DMR-GCN) to enhance behavior prediction in high-speed scenes;
(2): Multilayer graph modeling: creating a graph model to capture vehicle interactions at different spatio-temporal levels. Nodes represent vehicles or road objects, while edges reflect real-time interactions, with DMR-GCN dynamically adjusting edge weights;
(3): Adaptive network tuning: introducing an adaptive mechanism in DMR-GCN for real-time interaction capture and feature processing to improve prediction accuracy;
(4): Dynamic scene learning: implementing real-time learning strategies based on vehicle position, speed, and road conditions to improve accuracy and safety.

2. Related Work

2.1. Vehicle and Traffic Element Identification

In autonomous driving and advanced driver assistance systems (ADAS), the recognition of vehicles and traffic elements is a pivotal task. Recently, numerous research endeavors have focused on crafting efficient object detection and recognition algorithms, prominently including YOLO (You Only Look Once) [23], SSD (Single-Shot MultiBox Detector) [24], and Faster R-CNN (Faster Region-based Convolutional Neural Network) [25]. YOLO, a deep learning-powered object detection approach, operates by segmenting the input image into S × S grids, where each grid cell forecasts B bounding boxes along with their confidence scores and C conditional class probabilities. The object detection is facilitated through end-to-end regression optimization.

YOLO distinguishes itself in terms of speed, accuracy, and computational efficiency. It achieves a remarkable detection speed of 45–60 frames per second (FPS) on standard GPUs, coupled with a mean average precision (Map) of 57.9%, surpassing Faster R-CNN’s 42.7% Map. Furthermore, compared to SSD, YOLO offers faster processing rates (30–40 FPS for SSD) and a higher Map (46.5% for SSD). Notably, YOLO boasts a compact model size and minimal computational resource demands, rendering it an ideal candidate for deployment on embedded systems and mobile devices.

Given our reliance solely on monocular camera data and the imperative to drastically curtail computational overhead while maintaining high recognition accuracy, we opted for YOLO as the core object detection and recognition algorithm for our real-time, efficient vehicle and traffic element recognition framework in the context of autonomous driving and ADAS. Inspired by recent advances in vehicle tracking methods [26], we chose FairMOT-MCVT as our core algorithm for real-time efficient vehicle tracking for applications in autonomous driving and ADAS.

2.2. Vehicle Behavior Understanding

Understanding vehicle behavior and road scenarios is a crucial challenge in autonomous driving. Traditional approaches typically rely on data from multiple sensors to achieve this goal. Many studies have employed rule-based approaches and probabilistic modeling techniques to classify driving behavior. For example, Sivaraman et al. [27] used probabilistic modeling techniques to predict vehicle behavior. This approach effectively handles uncertainty but is hampered by model complexity and high computational resource demands. Numerous early studies primarily focused on predicting future vehicle trajectories rather than behavioral classification. For instance, Kitani et al. [28] utilized sensor data to predict vehicle trajectories, providing short-term driving path predictions but lacking a comprehensive understanding of driving behavior. Similarly, Lee et al. [29] proposed an approach based on trajectory history data, which, although adequate for short-term trajectory prediction, performs poorly in long-term behavior prediction.

This study takes a concise yet ambitious approach to predicting vehicle behavior by capturing and interpreting scenes from a single camera. While previous research has utilized single-camera data, most efforts have focused on predicting the behavior of the ego vehicle. In contrast, this research categorizes the behavior of surrounding vehicles from the perspective of the ego vehicle. Neumann and Vedaldi [30] demonstrated the potential of using monocular camera data for pedestrian and ego-vehicle trajectory prediction, illustrating how single-camera systems can offer valuable insights into dynamic road scenarios. Building on this, our approach enhances autonomous driving systems’ behavioral planning and state estimation capabilities and has potential applications in traffic violation detection within video footage. By meticulously analyzing the behavioral patterns of neighboring vehicles, this technique improves traffic safety, optimizes decision-making in autonomous driving systems, and ultimately enhances both the driving experience and the overall reliability of the transportation system.

2.3. Graph Neural Networks

In recent years, graph data modeling has successfully integrated deep learning methodologies, leading to the development of numerous algorithms for unstructured data. Meyer et al. [31] introduced a deep learning framework using heterogeneous graph representations to model complex traffic scenarios in autonomous driving. Lee et al. [32] devised a multi-agent trajectory prediction model based on graph neural networks, enhancing prediction performance for multi-modal behaviors. Li et al. [33] presented the STS-DGNN, a dynamic graph neural network that extracts spatial and temporal features for vehicle trajectory prediction, significantly improving accuracy and real-time performance. Zhang et al. [34] integrated the Graph Attention Transformer (Gatformer) into trajectory prediction, effectively capturing the spatio-temporal interplay between traffic agents and road infrastructure. Despite these advancements, these methods still encounter computational efficiency and scalability challenges, mainly when dealing with real-time dynamic data and large-scale, intricate scenes.

A framework based on multi-relational graph convolutional networks (MR-GCN) was developed to capture and update real-time spatio-temporal relationships between vehicles. As shown in Figure 1, the framework’s incorporation of MR-GCN capabilities and interactive learning mechanisms enhances real-time performance and prediction accuracy, addressing the complexities of modeling and predicting vehicle behavior in dynamic environments.

2.4. Current Challenges and Future Directions of Work

Despite significant advancements, predicting vehicle behavior and recognizing traffic elements in dynamic and complex environments remain challenging. These challenges include meeting real-time requirements, adapting to diverse traffic scenarios, and managing data noise and incompleteness. Additionally, reducing computational costs and energy consumption while maintaining model performance is a pressing concern.

To further improve model performance and practicality, future research could explore the following areas:

(1): Developing more robust graph neural network architectures that adapt to varying traffic complexities;
(2): Integrating diverse data streams from multiple sources, such as radar, LiDAR, and telematics, to enhance model comprehensiveness and accuracy;
(3): Investigating online learning and self-adaptive mechanisms to enable dynamic adjustment to shifting traffic environments;
(4): Optimizing computational resource usage to meet real-time application requirements.

Exploring these areas is expected to create a more intelligent and reliable system for predicting vehicle behavior and recognizing traffic elements.

3. Materials and Methods

The proposed methodological framework builds on and improves the data modeling methodology introduced in [35] (MR-GCN-LSTM), as shown in Figure 2. This multi-stage behavioral prediction process involves the following: first, modeling the video as a series of spatial scene graphs; second, generating a multi-relational interaction graph from this spatio-temporal information to represent the evolution of spatial relationships between entities over time; finally, designing a dynamic multi-relational graph convolutional network (DMR-GCN) to predict vehicle behavior using a graph-based learning model.

The proposed methodological framework’s performance enhancement is due to three key innovations:

(1): The dynamic multilevel relational graph (DMRG) design accurately captures the complex dynamics of vehicle lane changes by representing different spatio-temporal granularities through multiple layers and updating edge weights in real time;
(2): The method for generating temporal interaction graphs focuses on capturing temporal changes and complex interaction patterns such as vehicle acceleration, sharp braking, and behaviors under challenging traffic conditions (e.g., rainy days and nights). This comprehensive approach enhances the model’s understanding of dynamic vehicle interactions, improving prediction accuracy and adaptability to diverse traffic scenarios;
(3): The DMR-GCN structure includes a dynamic multi-relational graph convolutional network (DMR-GCN), dynamic scene perception, and an interactive learning mechanism.

DMR-GCN employs an adaptive adjacency matrix updating mechanism and a multi-relational feature fusion mechanism to achieve dynamic relational capture and comprehensive feature processing, enhancing adaptability to real-time data. The interactive learning mechanism allows the model to adjust to changing traffic environments, improving vehicle behavior prediction accuracy and real-time performance.

3.1. Traffic Scenario Map Construction

Traffic scene graph construction involves multiple steps. First, objects are identified and tracked across consecutive video frames. Next, these objects are oriented in a bird’s-eye view to facilitate the identification of their relative positions. Finally, for each frame, the relative position information of the objects is encoded with a spatial map, as shown in Figure 3.

3.1.1. Object Tracking

To fully understand the dynamic scene, three main feature extraction pipelines are used as inputs to recognize different objects and determine the spatial relationships of vehicles in the video. The key components for dynamic feature extraction include instance segmentation, semantic segmentation, pixel-by-pixel optical flow, and depth data in video frames. First, Mask R-CNN [36] is used, for instance, as segmentation to detect and track moving objects over time. Second, Detectron2 [37] is employed to achieve a semantic understanding of the scene, particularly for static objects such as traffic signs and lane markings, which serve as important landmarks. Finally, RAFT (Recurrent All-Pairs Field Transforms) [38] is used to compute the optical flow between consecutive video frames, enabling the tracking of each object in the scene by combining optical flow information with instance features and semantic segmentation.

3.1.2. Monocular to Bird’s-Eye View

After obtaining the stabilized trajectories in the image space, these trajectories are converted to a bird’s-eye view to capture the relative positions of different objects better. The geometric relation formulation described in GitNet [39] is used for object localization to achieve this conversion. Each object is assigned a reference point, namely the centroid of the lane marker for lanes and the point adjacent to the road for vehicles and other landmarks, as shown in Figure 2d. Let

b = (x, y, 1)

represent the reference point in homogeneous coordinates in the image. These homogeneous coordinates

b

are then converted to coordinates in the bird’s-eye view as follows:

B = (B x, B y, B z) = - \frac{h K^{- 1} b}{η^{T} K^{- 1} b}

(1)

where

K

is the camera internal reference matrix,

η

is the surface normal,

h

is the height of the camera to the ground plane, and

B

is the 3D coordinates of point

b

in the bird’s eye view.

The specific process is as follows: A perspective transformation is performed using the intrinsic matrix

K

and the extrinsic parameters of the camera. Assume a point in the camera’s coordinate system is

(X, Y, Z)

. The ground space is represented as

(X, 0, Z)

by setting the

y

-coordinate of the camera’s coordinate system to 0. The camera’s coordinate system is simplified by removing the

y

-dimension, resulting in

(X, Z)

. Consequently, the coordinates in the bird’s-eye view also remove the

y

-dimension and are represented as

(X, Z)

. The one-to-one correspondence between image coordinates and ground coordinates can be expressed by the following equation:

(\begin{matrix} u \\ v \\ 1 \end{matrix}) = K (\begin{matrix} X \\ Y \\ Z \end{matrix})

(2)

The inverse transformation from image coordinates to ground coordinates is given by the following:

(\begin{matrix} X \\ Y \\ Z \end{matrix}) = K^{- 1} (\begin{matrix} u \\ v \\ 1 \end{matrix})

(3)

After obtaining the trajectories of these objects in image space, they are relocated in the bird’s-eye view (top view) by projecting the image coordinates to 3D coordinates. This repositioning helps determine the spatial relationships between different entities. Each object is assigned a reference point to account for height differences. For lane markers, the reference point is at the center, while for vehicles, it is a point close to the road. Using the geometric relations described above, the transformation from perspective space to a bird’s-eye view can be achieved, allowing for the recovery of approximate ground coordinates from the image coordinates and the camera height [40]. However, in a real monocular perception system, the camera height is not directly available. This issue is addressed by learning the camera height parameters through the network. By combining optical flow information with instance features and semantic segmentation, each object in the scene can be accurately tracked in a bird’s-eye view, providing a more comprehensive understanding of the environment.

3.1.3. Spatial Scene Maps

At each frame

t

of the video, spatial information is captured based on the relative positions of different objects in the bird’s-eye view and represented as a scene graph

S_{t}^{i, j}

. At time

t

, the spatial relationship between subject

i

and object

j

S_{t}^{i, j}

is defined by the quadrant where the object is located relative to the subject, i.e.,

S_{t}^{i, j} \in R_{s}

, where

R_{s} =

{top left, top right, bottom left, bottom right}. An example of a spatial scene graph is illustrated in Figure 2e. These spatial relationships are determined by the 3D position information of the objects in the bird’s-eye view. At each time step

t

, the spatial relationship

S_{t}^{i, j}

between subject

i

and object

j

is represented as one of the four quadrants: upper left, upper right, lower left, and lower right. An example of a spatial scene graph is shown in Figure 2e.

3.2. Vehicle Behavior Spatio-Temporal Interaction

3.2.1. Dynamic Multi-Level Relationship Diagram Modeling

As shown in Figure 4, after generating the spatial scene graph, the dynamic multi-level relationship graph modeling method is introduced. This method is based on a multi-level structure that represents the relationships between vehicles at different spatial and temporal granularities, thereby capturing and understanding the dynamic interactions of vehicles in a more comprehensive manner. Compared with static spatial scene graphs, the dynamic multi-level relationship graph can update the interactions between vehicles in real time, improving the model’s prediction accuracy and robustness for vehicle lane-changing behavior. By representing relationships at different spatio-temporal granularities, the model better adapts to complex traffic situations, including rapidly changing vehicle behaviors on highways.

The interaction between vehicles and road objects such as traffic signs and obstacles can be modeled in complex traffic scenarios using the following equation:

G = (V, E)

(4)

where

V

represents the set of nodes, including vehicles and road objects, and

E

denotes the edges, reflecting the relationships between these nodes.

In the case of multi-level relationships, each level is characterized by a specific type of interaction. For instance, the first hierarchical level represents direct proximity, the second reflects extended proximity, and the third captures more distant or complex relationships. The overall relationship diagram of the transportation scenario can be represented as a combination of the relationship diagrams at each level:

G = \sum_{l = 1}^{L} G^{l} = \sum_{l = 1}^{L} (V, E^{l})

(5)

where

L

represents the total number of relationship layers.

For dynamic edge weights, each edge

e_{i j}^{l}

, connecting nodes

v_{i}

and

v_{j}

on layer lll, is assigned a time-dependent weight

w_{i j}^{l} (t)

. This weight is dynamically computed using the following equation:

w_{i j}^{l} (t) = \frac{1}{1 + e x p (- (α d_{i j} (t) + β Δ v_{i j} (t) + γ Δ a_{i j} (t)))}

(6)

where

d_{i j} (t)

denotes the distance between vehicles

v_{i}

and

v_{j}

at time

t

;

Δ v_{i j} (t)

represents the relative speed between the two vehicles;

a_{i j} (t)

indicates their relative acceleration; and

α

,

β

, and

γ

are coefficients that regulate the influence of these factors.

Ultimately, the complete mathematical expression for modeling dynamic multi-level relationship graphs is obtained as follows:

\begin{array}{l} G (t) & = (V, E (t)) = (V, \sum_{l = 1}^{L} θ_{l} \cdot E^{l} (t)) \\ = (V, \sum_{l = 1}^{L} θ_{l} \cdot \{(v_{i}, v_{j}, \frac{1}{1 + e x p (- (α d_{i j} (t) + β Δ v_{i j} (t) + γ Δ a_{i j} (t)))})\}) \end{array}

(7)

3.2.2. Enhanced Temporal Interaction Diagram Generation

In modern traffic prediction and vehicle interaction analysis, traditional temporal interaction map generation methods often consider only simple temporal evolution information. They cannot adequately capture complex vehicle dynamics and variable traffic scenarios. An enhanced temporal interaction graph generation method was designed to address this issue. This method focuses on temporal changes and captures complex interactions such as vehicle acceleration, hard braking, hard acceleration, and behavioral patterns under complex traffic conditions (e.g., rainy days and nighttime). The generated temporal interaction diagram includes information such as “Moving Forward”, “Moving Backward”, “Moving Left to Right”, “Moving Right to Left”, “No Change”, and more complex interactions. This comprehensive approach enables the model to capture and understand dynamic vehicle interactions better, thereby improving prediction accuracy and adaptability to diverse traffic scenarios.

To represent the time series graph, the traffic scenario at each time step

t

is modeled using the following equation:

G_{t} = (V_{t}, E_{t})

(8)

where

V_{t}

represents the set of nodes at time

t

, including all vehicles and road objects involved in the interaction at that moment, and

E_{t}

denotes the set of edges at time

t

, representing the interactions between these nodes.

To capture the dynamic interaction of vehicles over time, we consider the dynamic changes in edge weights

w_{i j} (t)

. The relationship between vehicles

v_{i}

and

v_{j}

is assumed to depend not only on their current state but also their behavior in previous frames. This relationship can be modeled using the following equation:

w_{i j} (t) = \sum_{τ = 0}^{Δ t} λ^{τ} \cdot f (v_{i} (t - τ), v_{j} (t - τ))

(9)

where

Δ t

represents the time window indicating the number of past time steps considered,

λ

is a decay factor that controls the influence of past time steps on the current edge weights, and

f (v_{i} (t - τ), v_{j} (t - τ))

is a function that calculates the interactions between vehicles

v_{i}

and

v_{j}

at time

t - τ

, which may be based on differences in velocity, relative positions, accelerations, and other factors.

Furthermore, we introduce the concept of time series convolution to capture complex interaction patterns over time. This technique aggregates vehicle interaction information over time, generating a new edge weight. The convolution operation can be expressed as follows:

{\tilde{w}}_{i j} (t) = σ (\sum_{k = 0}^{K - 1} W_{k} \cdot w_{i j} (t - k) + b_{k})

(10)

where

{\tilde{w}}_{i j} (t)

represents the edge weight after time convolution;

K

is the size of the convolution kernel, indicating the number of time steps considered;

W_{k}

is the convolution kernel weight for the

k

th time step;

b_{k}

is the bias term; and

σ

is the nonlinear activation function.

At each time step

t

, we obtain a temporally convolved graph

{\tilde{G}}_{t} = (V_{t}, {\tilde{E}}_{t})

, where

\tilde{E}

represents the set of edges obtained after convolving the time series. To capture the dynamic interactions over the entire period, we aggregate the convolved graphs from all time steps to obtain the final temporal interaction graph

G_{t e m p}

:

G_{t e m p} = \sum_{t = 1}^{T} {\tilde{G}}_{t}

(11)

where

T

denotes the total number of time steps. This temporal interaction graph synthesizes the complex behavioral patterns of vehicles over time, effectively reflecting dynamic changes throughout the entire time series. It is well suited for accurate behavioral prediction and analysis. Time series convolution enhances the model’s ability to capture complex interactions in the temporal dimension, thereby improving its performance in time-dynamic scenarios

.

3.3. Vehicle Behavior Prediction

The traditional MR-GCN framework fails to utilize dynamic knowledge to capture dynamic spatio-temporal correlations fully. A DMR-GCN structure was designed to address this issue, incorporating a multi-relational graph convolutional network (MR-GCN) and a dynamic scene perception and interactive learning mechanism. This structure models the spatio-temporal relationships between the vehicle, other vehicles, and landmarks in the scene over time to obtain a comprehensive spatio-temporal representation. This representation is achieved by stacking different neural network layers, effectively enhancing the capture of dynamic spatio-temporal correlations and improving the model’s expressiveness and accuracy, as shown in Figure 5.

3.3.1. DMR-GCN

Multi-relational graph convolutional networks (MR-GCNs) are a typical means of modeling interactions in knowledge graphs and spatio-temporal data. However, traditional MR-GCN frameworks use a fixed weight matrix for different relationship types, which limits adaptability to dynamic environments. A dynamic multi-relational graph convolutional network (DMR-GCN) was designed to address this. The core components of this network include an adaptive neighbor matrix updating mechanism and a multi-relationship feature fusion mechanism. These mechanisms allow for more flexible dynamic relationship capture and comprehensive multi-relationship feature processing, enhancing the network’s adaptability to real-time data. The specific steps are as follows:

(1)

Adaptive Neighborhood Matrix Update Mechanism:

(1): Node Feature Conversion

First, the node features

h_{G}^{k - 1}

of the

(k - 1)

th layer are converted into a query vector

Q

and a key vector

K

using the query weight matrix

W_{q}

and the key weight matrix

W_{k}

, respectively:

Q = h_{G}^{k - 1} W_{q}, K = h_{G}^{k - 1} W_{k}

(12)

where

W_{q}

,

W_{k} \in R^{d \times d} .

(2)
Similarity calculation

Next, the similarity between the nodes is calculated using the dot product, which is then scaled to prevent tremendous values:

{S i m i l a r i t y}_{i j} = \frac{Q_{i} \cdot K_{j}^{T}}{\sqrt{d_{k}}}

(13)

where

d_{k}

is the dimension of the key vector.

(3)
Weight normalization

Attention weights are obtained by normalizing the similarity using the following function:

α_{i j} = s o f t m a x (\frac{Q_{i} \cdot K_{j}^{T}}{\sqrt{d_{k}}}) = \frac{\exp (\frac{Q_{i} \cdot K_{j}^{T}}{\sqrt{d_{k}}})}{\sum_{j = 1}^{n} e x p (\frac{Q_{i} \cdot K_{j}^{T}}{\sqrt{d_{k}}})}

(14)

(4)
Dynamic neighborhood matrix update

Using the attention weights

α_{i j}

, we thus update the adjacency matrix of the relationship

r

:

{\hat{A}}_{r} [i, j] = α_{i j}

(15)

(2): Multi-Relational Feature Fusion Mechanisms

A multi-relational feature fusion mechanism was designed to synthesize features from different relationships. This mechanism enhances the model’s ability to capture richer interaction information by introducing adjacency matrices for multiple relationship types. Specifically, adjacency matrices are separately constructed for each relationship type, and these matrices are used to perform weighted fusion among the different relationship features, generating a comprehensive feature representation. This approach more accurately reflects the multiple interactions between nodes and improves the model’s expressiveness and predictive performance in complex scenarios.

(1)
Multi-relational graph convolution

A multi-relationship graph convolution operation is performed on the updated adjacency matrix

{\hat{A}}_{r}

. For relationship

r

, the node feature representation

h_{G}^{k}

at the

k

th layer is computed using the following equation:

h_{G}^{k} = R e L U (\sum_{r \in R} {\hat{A}}_{r} h_{G}^{k - 1} W_{r}^{k} + h_{G}^{k - 1} W_{s}^{k})

(16)

where

{\hat{A}}_{r}

is the adjacency matrix of relationship r dynamically updated through the attention mechanism,

W_{r}^{k}

is the weight matrix of the

k

th layer associated with relationship

r

,

W_{s}^{k}

is the weight matrix of the

k

th layer for calculating the node information (self-loop), and

R e L U

is the activation function.

(3): Feature Fusion

The results of the convolution for all relationships are weighted and fused to combine the effects of the different relationships:

h_{G}^{k} = R e L U (\sum_{r \in R} β_{r} {\hat{A}}_{r} h_{G}^{k - 1} W_{r}^{k} + h_{G}^{k - 1} W_{s}^{k})

(17)

where

β_{r}

is the fusion weight of relationship

r

, which is learned through training dynamics.

(4): Representation of Interlayer Stacking and Multi-Hop Relationships

The multi-hop relational representation of a node can be obtained by stacking multiple layers of MR-GCNs. If

K

layers of MR-GCN are stacked, the

K

-hop relational representation

h_{G}^{k}

is provided as follows:

h_{G}^{k + 1} = R e L U (\sum_{r \in R} β_{r} {\hat{A}}_{r} h_{G}^{k} W_{r}^{k + 1} + h_{G}^{k} W_{s}^{k + 1})

(18)

Overall, DMR-GCN achieves more flexible dynamic relationship capture and comprehensive multi-relational feature processing by introducing an adaptive adjacency matrix updating mechanism and a multi-relational feature fusion mechanism. These enhancements significantly improve the model’s adaptability and predictive ability in complex scenes.

3.3.2. Dynamic Scene Perception and Interactive Learning Mechanism

After encoding the spatial relationships using DMR-GCN, a dynamic scene sensing and interactive learning mechanism is introduced. This mechanism dynamically adjusts the model’s learning process by sensing real-time changes in the traffic scene and interaction patterns of vehicles. Through interactive learning, the model can adjust parameters and weights in real time during the prediction process, thereby improving predictions’ real-time performance and accuracy based on the current traffic situation and vehicle interaction patterns.

4. Experimentation and Analysis

4.1. Datasets

To comprehensively evaluate the performance of our framework, three representative datasets were selected: ApolloScapes [41], KITTI [42], the Indian dataset, and a self-collected dataset. Following the criteria outlined in [35], the ApolloScapes, KITTI, and Indian datasets were partitioned into training, testing, and validation sets. This partitioning method was innovatively extended to the self-collected dataset, with specialized data annotation tailored to the task requirements.

In terms of datasets, the ApolloScapes dataset serves as the core. Its rich driving scenarios, especially overtaking and lane-changing behaviors, are fully demonstrated by the selected 4000 frames of images, providing a solid foundation for model training. We focused on highway and comprehensive road environments for the KITTI dataset, selecting 700 frames from tracking sequences 4, 5, and 10 to evaluate the model’s generalization ability under complex road conditions. The Indian dataset, with its unique vehicle types (e.g., tricycles and trucks) and congested driving environments, presents a severe test of the model’s transfer learning capability; 600 frames of images provide a small but challenging sample. The 65 km section of the S20 Outer Ring Highway and G2 Beijing-Shanghai Highway from Pudong to Jiading in Shanghai, characterized by high traffic flow and various vehicle types, frequently experiences traffic congestion due to lane-changing collisions, making it an ideal route for data acquisition, as shown in Figure 6. Video captured by the front monocular camera of the data collection vehicle on this road section constituted the experimental dataset.

Prior to training the network, extensive data preprocessing was performed, including normalizing the image pixel values to the [0, 1] range, adjusting the images to a consistent resolution, and applying data enhancement methods (e.g., rotation, flipping, and luminance adjustments) to improve the robustness of the model under varying lighting and environmental conditions. The network training process included dividing the dataset into a training set (70%), validation set (15%), and test set (15%), and training was performed using a cross-entropy loss function and an Adam optimizer with a learning rate of 0.001. The training lasted 300 epochs, and an early stopping strategy was implemented to prevent overfitting. Key hyperparameters include a batch size of 32, a learning rate of 0.001, a discard rate of 0.5 (to mitigate overfitting), and an architecture consisting of five convolutional layers and three fully connected layers.

At the behavior prediction level, five categories of crucial vehicle behaviors are defined: continuous left–right lane change (CLAL/R), left–right lane change (LCL/LCR), and overtaking (OVT). These categories ensure comprehensive coverage of vehicle dynamic behaviors. Through monocular video processing, the framework was rigorously evaluated using the KITTI, ApolloScapes, and Indian datasets. The selected driving scenarios span various urban and highway conditions and include both standard and non-standard vehicles. This comprehensive evaluation validated the predictive accuracy and adaptability of the model. By integrating multivariate datasets with detailed data annotation, our evaluation framework enhanced the model’s predictive ability under complex highway driving conditions and improved its adaptability to various vehicles and environments. This integration lays a solid foundation for developing safer, more efficient autonomous driving systems.

4.2. Qualitative Results

4.2.1. Ablation Experiments

Ablation studies were conducted to verify the effectiveness of the DMR-GCN framework and its components. The experimental results are shown in Table 1 and Figure 9a. The base model (MR-GCN) performed mediocrely on all metrics, indicating its limitations in capturing vehicle lane-changing behavior and complex interaction patterns. After adding the dynamic multilevel relationship graph (DRG), the accuracy and recall of the model significantly improved from 0.85 and 0.84 to 0.87 and 0.86 for CLAL and CLAR, respectively; from 0.91 to 0.93 for Map; and from 0.90 to 0.92 for F1. These improvements suggest that the DRG is effective in capturing complex dynamics.

Introducing the enhanced temporal interaction graph generation method (ETIG) improves the model’s ability to capture complex interaction patterns significantly, with CLAL and CLAR reaching 0.88 and 0.87, respectively, and Map and F1 reaching 0.94 and 0.93. Additionally, graph attention’s dynamic neighborhood matrix updating mechanism (GAMU) further improved dynamic relationship capturing and feature processing, achieving CLAL and CLAR scores of 0.86 and 0.85 and Map and F1 scores of 0.92 and 0.91, respectively.

The model’s performance significantly improved when combined with the dynamic multi-level relational graph (DRG) and enhanced temporal interaction graph generation method (ETIG) components. Adding these components boosted the model’s accuracies on continuous left-lane alteration (CLAL) and continuous right-lane alteration (CLAR) to 0.89 and 0.88, respectively. At the same time, the mean average precision (Map) and F1 scores increased to 0.94 and 0.93. These enhancements validate the effectiveness of the DRG and ETIG components in capturing and predicting complex traffic dynamics.

Following integrating all components, the model showed robust performance across all metrics, with CLAL and CLAR scores of 0.89 and 0.88 and Map and F1 scores of 0.94 and 0.93. This demonstrates the components’ effectiveness in enhancing the model’s understanding and prediction of complex traffic scenarios. In conclusion, integrating these components significantly improved the MR-GCN framework’s capacity to handle intricate interaction patterns and various traffic scenarios.

4.2.2. Comparison Experiments

To validate the proposed method, it was trained and tested on several datasets, including the Apollo, KITTI, and Indian databases. Comparisons were made with the baseline model, the LSTM-based graph network Structural-RNN, and the LSTM + Multi-head Attention model. The experimental results are shown in Table 2 and Figure 9b. These comparison experiments indicate that the proposed method outperforms other models across the various datasets.

Specifically, for the continuous left- and right-lane change (CLAL and CLAR) metrics, our method improved from 0.80 and 0.79 to 0.88 and 0.87 on the Apollo dataset. Similarly, it improved from 0.81 and 0.80 to 0.89 and 0.88 on the KITTI dataset and from 0.82 and 0.81 to 0.90 and 0.89 on the Indian dataset. These improvements highlight the advantages of dynamic multilayer relational maps in effectively capturing complex interactions between vehicles and road objects.

For the left- and right-lane change (LCL and LCR) metrics, the MR-GCN framework showed enhancements from 0.77 and 0.76 to 0.86 and 0.85 on the Apollo dataset. It also improved from 0.78 and 0.77 to 0.87 and 0.86 on the KITTI dataset and from 0.79 and 0.78 to 0.88 and 0.87 on the Indian dataset, demonstrating higher accuracy and robustness. This illustrates the efficacy of the MR-GCN framework in addressing diverse datasets and intricate interaction behaviors.

In terms of overtaking (OVT) metrics, improvements were observed on the Apollo dataset, increasing from 0.74 to 0.83. Similarly, the KITTI dataset increased from 0.75 to 0.84, while the Indian dataset improved from 0.76 to 0.85. These results underscore the method’s effectiveness in complex traffic scenarios, confirming the MR-GCN framework’s ability to handle real-time dynamic data.

Additionally, the MR-GCN framework exhibited excellent performance in average inference time and processing speed, achieving an average inference time of 35 milliseconds per frame and a processing speed of 28.6 frames per second, a significant improvement over other models. This short inference time and high frame rate enable it to respond to input data rapidly, making it well suited for real-time applications such as autonomous driving and traffic management systems. Reacting quickly and accurately is crucial for reducing accidents and enhancing road safety. Consequently, the MR-GCN framework’s outstanding performance in dynamic traffic environments indicates its broad application potential in intelligent transportation systems. Furthermore, future optimizations of the algorithm and hardware will likely enhance its real-time processing capabilities, providing a stronger foundation for practical applications.

Finally, the MR-GCN framework demonstrated superior performance across all datasets for critical metrics such as mean average precision (Map), Recall, and F1. On the Indian dataset, the framework achieved scores of 0.94, 0.92, and 0.93, respectively. On the Apollo dataset, Map improved from 0.87 to 0.94, Recall from 0.85 to 0.92, and F1 from 0.86 to 0.93. On the KITTI dataset, Map increased from 0.87 to 0.94, Recall from 0.86 to 0.92, and F1 from 0.86 to 0.93. These results demonstrate the adaptability and prediction accuracy of the MR-GCN framework in diverse traffic scenarios, showcasing its significant advantages and reliability in practical applications.

4.2.3. Visualization of Results

The model’s performance was further validated on the KITTI and our self-collected datasets. Figure 7 and Figure 8 show the qualitative results of these datasets. A consistent color-coding convention describes vehicle behavior: green for vehicles moving away, red for vehicles driving usually, and blue for vehicles changing lanes to the right. In Figure 7a,b show examples of vehicles accelerating away from their lane. Figure 7b,c depict a vehicle ahead changing lanes to the right, with Figure 7c showing the initial lane change and Figure 7d showing the vehicle completing the maneuver.

In Figure 8a illustrates a car changing lanes, while Figure 8b shows a motorcycle accelerating. Figure 8c captures a car on the right side of the lane merging into the left lane, and Figure 8d depicts a vehicle on the right side of the road completing a lane change and preparing to change lanes again. These qualitative results further validate the classification and generalization capabilities of the proposed model on the self-collected dataset, even with fewer or no observed test vehicles.

4.3. Transfer Learning

To validate the transfer learning capability of the DMR-GCN model, experiments were conducted across multiple datasets. Specifically, the model was trained on the KITTI dataset and then tested on the Apollo, Indian, and a validation set of self-collected datasets. During testing, categories that did not exist in the corresponding dataset were excluded to ensure a fair evaluation. The experimental results are shown in Table 3 and Figure 9c.

The experimental results demonstrated that the transfer learning model exhibits robust performance across all target datasets, underscoring the model’s capacity for generalization. The transfer learning model achieved CLAL and CLAR scores of 0.88 and 0.87, LCL and LCR scores of 0.86 and 0.85, and accuracy and Recall rates of 0.94 and 0.92 on the Apollo dataset. These results outperform the model trained and tested solely on the KITTI dataset, demonstrating its resilience and adaptability to diverse scenarios.

On the Indian dataset, the transfer learning model showed high prediction accuracy in various traffic scenarios, with CLAL and CLAR scores of 0.87 and 0.86, LCL and LCR scores of 0.85 and 0.84, and accuracy and Recall rates of 0.94 and 0.92, respectively.

For the self-collected dataset, the transfer learning model achieved CLAL and CLAR scores of 0.89 and 0.88, LCL and LCR scores of 0.87 and 0.86, and accuracy and Recall rates of 0.94 and 0.92. These results further validate the model’s high fidelity and adaptability across different traffic datasets, demonstrating its broad applicability and reliability in practical applications.

The results of the ablation experiment, comparison experiment, and transfer learning experiment are shown in Figure 9. It is clear from the figure that the DMR-GCN model outperformed the other models across all metrics. Specifically, the ablation experiments demonstrated the contribution of each component to the overall model performance; the comparison experiments showed that DMR-GCN significantly outperforms the baseline model and other existing methods in terms of accuracy, Recall, and F1 scores; and the transfer learning results validated the ability of DMR-GCN to generalize across different datasets. These findings fully demonstrate the superiority and robustness of the DMR-GCN model in handling complex traffic scenarios.

4.4. Discussion

The experimental results indicate that the DMR-GCN framework outperforms the baseline model and other existing methods across various traffic scenarios. Comparisons across the Apollo, KITTI, and Indian datasets revealed that the model achieved accuracy scores of 0.89 and 0.88 for continuous left-lane alteration (CLAL) and continuous right-lane alteration (CLAR), respectively, while the baseline model scored only 0.80 and 0.79. This improvement underscores the effectiveness of the dynamic multi-level relational graph (DRG) and the enhanced temporal interaction graph generation method (ETIG).

Specifically, the MR-GCN framework demonstrated increases in mean accuracy (Map) and F1 scores from 0.91 and 0.90 to 0.94 and 0.93, respectively, highlighting the contributions of these components in capturing complex traffic dynamics. In transfer learning experiments, the model achieved CLAL and CLAR scores of 0.87 and 0.86 on the unseen Indian dataset, in contrast to the lower performance of models trained solely on the KITTI dataset, further validating the model’s generalization capabilities.

However, the limitations of this study include the necessity for improved adaptability to non-standard vehicles and extreme traffic conditions. Future research could address these issues by expanding the dataset to encompass various vehicle types and more complex traffic scenarios, enhancing the model’s robustness. Additionally, incorporating richer external knowledge graphs may further improve the model’s generalization under complex conditions.

In summary, the results of this study demonstrate that the DMR-GCN framework achieves high accuracy and adaptability in predicting complex traffic behaviors, providing a valuable foundation for the advancement of autonomous driving technologies.

5. Conclusions

An innovative MR-GCN-based framework is herein proposed to address frequent accidents and traffic flow issues caused by sudden lane changes in highway scenarios. The complex dynamics of vehicle lane changing in highway environments are accurately captured using continuous time-series images from a moving monocular camera, combined with a designed dynamic multi-level relational graph (DMLRG). The enhanced temporal interaction graph generation (ETIG) method further improved the model’s accuracy in capturing dynamic vehicle interactions, especially under congested road conditions. The graph attention-based dynamic adjacency matrix update mechanism (GAMU) achieves dynamic adaptation to real-time traffic data by integrating an adaptive adjacency matrix update strategy with a multi-relational feature fusion technique.

Experimental validation on multiple benchmark datasets, including KITTI, Apollo, and Indian datasets, showed that the proposed DMR-GCN framework significantly improves vehicle behavior prediction performance, with Map, Recall, and F1 scores of 90%, 88%, and 89%, respectively, outperforming existing algorithms. Additionally, tests on a custom high-speed traffic scene dataset further demonstrated the model’s robustness and adaptability, achieving a Map of 91%, a Recall of 89%, and an F1 score of 90%. In summary, the DMR-GCN framework not only theoretically enhances traffic behavior modeling methodology but also shows strong application potential and stability in practice, providing robust support for efficient vehicle behavior prediction.

Future work will optimize the model’s real-time and generalization capabilities to adapt to more complex and dynamic traffic environments. Additionally, further research will explore integrating advanced machine learning techniques such as reinforcement learning to enhance decision-making processes in real-time traffic scenarios. Investigating the incorporation of multi-agent systems may also offer new insights into cooperative vehicle behaviors, ultimately contributing to safer and more efficient autonomous driving systems.

Author Contributions

Conceptualization, M.L. (Menghao Li) and W.Z.; methodology, M.L. (Menghao Li); software, M.L. (Menghao Li); validation, M.L. (Menghao Li), W.Z., M.L. (Miao Liu) and E.C.; formal analysis, M.L. (Menghao Li); investigation, M.L. (Miao Liu), E.C., C.H. and M.Z.; resources, M.L. (Miao Liu); data curation, M.L. (Miao Liu), C.H. and M.Z.; writing—original draft preparation, M.L. (Menghao Li); writing—review and editing, E.C.; supervision, W.G.; project administration, M.L. (Miao Liu); funding acquisition, W.Z. and W.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Shanghai Special Funds for Centralized Guided Local Science and Technology Development (YDZX20233100002002) and supported by the Postdoctoral Fellowship Program of CPSF under Grant Number GZB20240351.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in the study are available on request from the corresponding author. The data are not publicly available due to privacy.

Conflicts of Interest

Author Weiwei Zhang was employed by the company Shanghai Smart Vehicle Cooperating Innovation Center Co., Ltd. The remaining authors declare that the re-search was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Feng, D.; Haase-Schütz, C.; Rosenbaum, L.; Hertlein, H.; Glaeser, C.; Timm, F.; Wiesbeck, W.; Dietmayer, K. Deep multi-modal object detection and semantic segmentation for autonomous driving: Datasets, methods, and challenges. IEEE Trans. Intell. Transp. Syst. 2020, 22, 1341–1360. [Google Scholar] [CrossRef]
Kuefler, A.; Morton, J.; Wheeler, T.; Kochenderfer, M. Imitating driver behavior with generative adversarial networks. In Proceedings of the 2017 IEEE Intelligent Vehicles Symposium (IV), Los Angeles, CA, USA, 11–14 June 2017; pp. 204–211. [Google Scholar]
Cui, Z.; Ke, R.; Pu, Z.; Wang, Y. Stacked bidirectional and unidirectional LSTM recurrent neural network for forecasting network-wide traffic state with missing values. Transp. Res. Part C Emerg. Technol. 2020, 118, 102674. [Google Scholar] [CrossRef]
Sharma, S.; Das, A.; Sistu, G.; Halton, M.; Eising, C. BEVSeg2TP: Surround View Camera Bird’s-Eye-View Based Joint Vehicle Segmentation and Ego Vehicle Trajectory Prediction. arXiv 2023, arXiv:2312.13081. [Google Scholar]
Messaoud, K.; Yahiaoui, I.; Verroust-Blondet, A.; Nashashibi, F. Attention based vehicle trajectory prediction. IEEE Trans. Intell. Veh. 2020, 6, 175–185. [Google Scholar] [CrossRef]
Chen, F.; Li, P.; Wu, C. Dgc: Training dynamic graphs with spatio-temporal non-uniformity using graph partitioning by chunks. Proc. ACM Manag. Data 2023, 1, 1–25. [Google Scholar] [CrossRef]
Zheng, Y.; Wei, Z.; Liu, J. Decoupled graph neural networks for large dynamic graphs. arXiv 2023, arXiv:2305.08273. [Google Scholar] [CrossRef]
Mo, X.; Xing, Y.; Lv, C. Graph and recurrent neural network-based vehicle trajectory prediction for highway driving. In Proceedings of the 2021 IEEE International Intelligent Transportation Systems Conference (ITSC), Indianapolis, IN, USA, 19–22 September 2021; pp. 1934–1939. [Google Scholar]
Yang, M.; Zhu, H.; Wang, T.; Cai, J.; Weng, X.; Feng, H.; Fang, K. Vehicle Interactive Dynamic Graph Neural Network Based Trajectory Prediction for Internet of Vehicles. IEEE Internet Things J. 2024. [Google Scholar] [CrossRef]
Xu, X.; Zhang, L.; Liu, B.; Liang, Z.; Zhang, X. Transport-Hub-Aware Spatial-Temporal Adaptive Graph Transformer for Traffic Flow Prediction. arXiv 2023, arXiv:2310.08328. [Google Scholar]
Han, X.; Gong, S. LST-GCN: Long Short-Term Memory embedded graph convolution network for traffic flow forecasting. Electronics 2022, 11, 2230. [Google Scholar] [CrossRef]
Kumar, R.; Mendes Moreira, J.; Chandra, J. DyGCN-LSTM: A dynamic GCN-LSTM based encoder-decoder framework for multistep traffic prediction. Appl. Intell. 2023, 53, 25388–25411. [Google Scholar] [CrossRef]
Katayama, H.; Yasuda, S.; Fuse, T. Traffic density based travel-time prediction with GCN-LSTM. In Proceedings of the 2022 IEEE 25th International Conference on Intelligent Transportation Systems (ITSC), Macau, China, 8–12 October 2022; pp. 2908–2913. [Google Scholar]
Zhang, T.; Guo, G. Graph attention LSTM: A spatiotemporal approach for traffic flow forecasting. IEEE Intell. Transp. Syst. Mag. 2020, 14, 190–196. [Google Scholar] [CrossRef]
Kosaraju, V.; Sadeghian, A.; Martín-Martín, R.; Reid, I.; Rezatofighi, H.; Savarese, S. Social-bigat: Multimodal trajectory forecasting using bicycle-gan and graph attention networks. In Proceedings of the Advances in Neural Information Processing Systems 32, Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
Sun, J.; Jiang, Q.; Lu, C. Recursive social behavior graph for trajectory prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 660–669. [Google Scholar]
Ivanovic, B.; Pavone, M. The trajectron: Probabilistic multi-agent trajectory modeling with dynamic spatiotemporal graphs. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 2375–2384. [Google Scholar]
Li, X.; Ying, X.; Chuah, M.C. Grip: Graph-based interaction-aware trajectory prediction. In Proceedings of the 2019 IEEE Intelligent Transportation Systems Conference (ITSC), Auckland, New Zealand, 27–30 October 2019; pp. 3960–3966. [Google Scholar]
Chandra, R.; Bhattacharya, U.; Bera, A.; Manocha, D. Traphic: Trajectory prediction in dense and heterogeneous traffic using weighted interactions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 8483–8492. [Google Scholar]
Lu, Y.; Chen, Y.; Zhao, D.; Liu, B.; Lai, Z.; Chen, J. CNN-G: Convolutional neural network combined with graph for image segmentation with theoretical analysis. IEEE Trans. Cogn. Dev. Syst. 2020, 13, 631–644. [Google Scholar] [CrossRef]
Gao, J.; Sun, C.; Zhao, H.; Shen, Y.; Anguelov, D.; Li, C.; Schmid, C. Vectornet: Encoding hd maps and agent dynamics from vectorized representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11525–11533. [Google Scholar]
Chaochen, Z.; Zhang, Q.; Li, D.; Li, H.; Pang, Z. Vehicle trajectory prediction based on graph attention network. In Proceedings of the Cognitive Systems and Information Processing: 6th International Conference, ICCSIP 2021, Suzhou, China, 20–21 November 2021; pp. 427–438. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 JUne 2016; pp. 779–788. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
Siradjuddin, I.A.; Muntasa, A. Faster region-based convolutional neural network for mask face detection. In Proceedings of the 2021 5th International Conference on Informatics and Computational Sciences (ICICoS), Semarang, Indonesia, 24–25 November 2021; pp. 282–286. [Google Scholar]
Li, M.; Liu, M.; Zhang, W.; Guo, W.; Chen, E.; Zhang, C. A Robust Multi-Camera Vehicle Tracking Algorithm in Highway Scenarios Using Deep Learning. Appl. Sci. 2024, 14, 7071. [Google Scholar] [CrossRef]
Sivaraman, S.; Trivedi, M.M. Looking at vehicles on the road: A survey of vision-based vehicle detection, tracking, and behavior analysis. IEEE Trans. Intell. Transp. Syst. 2013, 14, 1773–1795. [Google Scholar] [CrossRef]
Kitani, K.M.; Ziebart, B.D.; Bagnell, J.A.; Hebert, M. Activity forecasting. In Proceedings of the Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, 7–13 October 2012; pp. 201–214. [Google Scholar]
Lee, J.; Balakrishnan, A.; Gaurav, A.; Czarnecki, K.; Sedwards, S. W ise m ove: A framework to investigate safe deep reinforcement learning for autonomous driving. In Proceedings of the Quantitative Evaluation of Systems: 16th International Conference, QEST 2019, Glasgow, UK, 10–12 September 2019; Proceedings 16, 2019. pp. 350–354. [Google Scholar]
Neumann, L.; Vedaldi, A. Pedestrian and ego-vehicle trajectory prediction from monocular camera. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 10204–10212. [Google Scholar]
Meyer, E.; Brenner, M.; Zhang, B.; Schickert, M.; Musani, B.; Althoff, M. Geometric deep learning for autonomous driving: Unlocking the power of graph neural networks with CommonRoad-Geometric. In Proceedings of the 2023 IEEE Intelligent Vehicles Symposium (IV), Anchorage, AK, USA, 4–7 June 2023; pp. 1–8. [Google Scholar]
Lee, D.; Gu, Y.; Hoang, J.; Marchetti-Bowick, M. Joint interaction and trajectory prediction for autonomous driving using graph neural networks. arXiv 2019, arXiv:1912.07882. [Google Scholar]
Li, F.-J.; Zhang, C.-Y.; Chen, C.P. STS-DGNN: Vehicle Trajectory Prediction Via Dynamic Graph Neural Network with Spatial-Temporal Synchronization. IEEE Trans. Instrum. Meas. 2023, 72, 1–13. [Google Scholar] [CrossRef]
Zhang, K.; Feng, X.; Wu, L.; He, Z. Trajectory prediction for autonomous driving using spatial-temporal graph attention transformer. IEEE Trans. Intell. Transp. Syst. 2022, 23, 22343–22353. [Google Scholar] [CrossRef]
Mylavarapu, S.; Sandhu, M.; Vijayan, P.; Krishna, K.M.; Ravindran, B.; Namboodiri, A. Towards accurate vehicle behaviour classification with multi-relational graph convolutional networks. In Proceedings of the 2020 IEEE Intelligent Vehicles Symposium (IV), Las Vegas, NV, USA, 19 October–13 November 2020; pp. 321–327. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Pham, V.; Pham, C.; Dang, T. Road damage detection and classification with detectron2 and faster r-cnn. In Proceedings of the 2020 IEEE International Conference on Big Data (Big Data), Atlanta, GA, USA, 10–13 December 2020; pp. 5592–5601. [Google Scholar]
Teed, Z.; Deng, J. Raft: Recurrent all-pairs field transforms for optical flow. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part II 16, 2020. pp. 402–419. [Google Scholar]
Gong, S.; Ye, X.; Tan, X.; Wang, J.; Ding, E.; Zhou, Y.; Bai, X. Gitnet: Geometric prior-based transformation for birds-eye-view segmentation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 396–411. [Google Scholar]
Ammar Abbas, S.; Zisserman, A. A geometric approach to obtain a bird’s eye view from an image. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Huang, X.; Cheng, X.; Geng, Q.; Cao, B.; Zhou, D.; Wang, P.; Lin, Y.; Yang, R. The apolloscape dataset for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 954–960. [Google Scholar]
Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The kitti vision benchmark suite. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 3354–3361. [Google Scholar]

Figure 1. Schematic Framework for Predicting Vehicle Behavior in High-Speed Scenarios Using Vision and Dynamic Graph Neural Networks.

Figure 2. The pipeline of overall network architecture.

Figure 3. Traffic scenario map construction.

Figure 4. Modeling spatial and temporal interaction of vehicle behavior.

Figure 5. DMR-GCN structure.

Figure 6. Data collection equipment and collection routes.

Figure 7. Model validation visualization results on KITTI dataset.

Figure 8. Model validation visualization results on self-selected dataset.

Figure 9. Comparative line graph of experimental results. (a) DMR-GCN framework ablation experiments, (b) Pairwise comparison results of different models for Apollo, KITTI and Indian datasets, (c) Results of the transfer learning experiment.

Table 1. DMR-GCN framework ablation experiments.

	CLAL	CLAR	LCL	LCR	OVT	Map	Recall	F1
Architecture	CLAL	CLAR	LCL	LCR	OVT	Map	Recall	F1
MR-GCN	0.85	0.84	0.83	0.82	0.80	0.91	0.89	0.90
MR-GCN + DRG	0.87	0.86	0.85	0.84	0.82	0.93	0.91	0.92
MR-GCN + ETIG	0.88	0.87	0.86	0.85	0.83	0.94	0.92	0.93
MR-GCN + GAMU	0.86	0.85	0.84	0.83	0.81	0.92	0.90	0.91
MR-GCN + DRG + ETIG	0.89	0.88	0.87	0.86	0.84	0.94	0.92	0.93
MR-GCN + DRG + GAMU + ETIG	0.89	0.88	0.87	0.86	0.84	0.94	0.92	0.93

Table 2. Pairwise comparison results of different models for Apollo, KITTI, and Indian datasets.

Train and Test On		Apollo				KITTI				Indian
	Method	Baseline	St-RNN	LSTM+ Mutil-Head Attention	Ours	Baseline	St-RNN	LSTM+ Mutil-Head Attention	Ours	Baseline	St-RNN	LSTM+ Mutil-Head Attention	Ours
Action		Baseline	St-RNN	LSTM+ Mutil-Head Attention	Ours	Baseline	St-RNN	LSTM+ Mutil-Head Attention	Ours	Baseline	St-RNN	LSTM+ Mutil-Head Attention	Ours
CLAL		0.80	0.82	0.85	0.88	0.81	0.83	0.86	0.89	0.82	0.84	0.87	0.90
CLAR		0.79	0.81	0.84	0.87	0.80	0.82	0.85	0.88	0.81	0.83	0.86	0.89
LCL		0.77	0.80	0.83	0.86	0.78	0.81	0.84	0.87	0.79	0.82	0.85	0.88
LCR		0.76	0.79	0.82	0.85	0.77	0.80	0.83	0.86	0.78	0.81	0.84	0.87
OVT		0.74	0.77	0.80	0.83	0.75	0.78	0.81	0.84	0.76	0.79	0.82	0.85
Map		0.87	0.90	0.92	0.94	0.87	0.90	0.92	0.94	0.87	0.90	0.92	0.94
Recall		0.85	0.88	0.90	0.92	0.86	0.88	0.90	0.92	0.85	0.88	0.90	0.92
F1		0.86	0.89	0.91	0.93	0.86	0.89	0.91	0.93	0.86	0.89	0.91	0.93
IT		50	45	40	35	52	46	41	36	51	44	39	34
FPS		20	22.2	25	28.6	19.2	21.7	24.4	27.8	19.6	22.7	25.6	29.4

Table 3. Results of the transfer learning experiment.

Train On	KITTI
Test On	Apollo	Indian	Ours
CLAL	0.88	0.87	0.89
CLAR	0.87	0.86	0.88
LCL	0.86	0.85	0.87
LCR	0.85	0.84	0.86
OVT	0.83	0.82	0.84
Map	0.94	0.94	0.94
Recall	0.92	0.92	0.92
F1	0.93	0.93	0.93

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, M.; Liu, M.; Zhang, W.; Guo, W.; Chen, E.; Hu, C.; Zhang, M. An Algorithm for Predicting Vehicle Behavior in High-Speed Scenes Using Visual and Dynamic Graphical Neural Network Inference. Appl. Sci. 2024, 14, 8873. https://doi.org/10.3390/app14198873

AMA Style

Li M, Liu M, Zhang W, Guo W, Chen E, Hu C, Zhang M. An Algorithm for Predicting Vehicle Behavior in High-Speed Scenes Using Visual and Dynamic Graphical Neural Network Inference. Applied Sciences. 2024; 14(19):8873. https://doi.org/10.3390/app14198873

Chicago/Turabian Style

Li, Menghao, Miao Liu, Weiwei Zhang, Wenfeng Guo, Enqing Chen, Chunguang Hu, and Maomao Zhang. 2024. "An Algorithm for Predicting Vehicle Behavior in High-Speed Scenes Using Visual and Dynamic Graphical Neural Network Inference" Applied Sciences 14, no. 19: 8873. https://doi.org/10.3390/app14198873

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

An Algorithm for Predicting Vehicle Behavior in High-Speed Scenes Using Visual and Dynamic Graphical Neural Network Inference

Abstract

1. Introduction

2. Related Work

2.1. Vehicle and Traffic Element Identification

2.2. Vehicle Behavior Understanding

2.3. Graph Neural Networks

2.4. Current Challenges and Future Directions of Work

3. Materials and Methods

3.1. Traffic Scenario Map Construction

3.1.1. Object Tracking

3.1.2. Monocular to Bird’s-Eye View

3.1.3. Spatial Scene Maps

3.2. Vehicle Behavior Spatio-Temporal Interaction

3.2.1. Dynamic Multi-Level Relationship Diagram Modeling

3.2.2. Enhanced Temporal Interaction Diagram Generation

3.3. Vehicle Behavior Prediction

3.3.1. DMR-GCN

3.3.2. Dynamic Scene Perception and Interactive Learning Mechanism

4. Experimentation and Analysis

4.1. Datasets

4.2. Qualitative Results

4.2.1. Ablation Experiments

4.2.2. Comparison Experiments

4.2.3. Visualization of Results

4.3. Transfer Learning

4.4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI