1. Introduction
Research on road safety indicates that the majority of traffic accidents stem from improper driver maneuvers. Despite achieving satisfactory performance in specific scenarios, fully autonomous driving remains a long-term objective due to the ongoing necessity for comprehensive legislation, regulations, and infrastructure development [
1]. Consequently, human–machine cooperative driving continues to be a crucial research direction in intelligent transportation systems. Understanding the driver’s intent is an essential prerequisite for effective human–machine interaction and facilitating autonomous vehicle decision-making that aligns with drivers’ preferences in uncertain environments as well as alerting drivers during hazardous situations is necessary [
2]. However, predicting human intent poses challenges, which are attributed to factors affecting human drivers, such as distraction, the driver’s emotional state, and lack of concentration, thereby leading to potential road hazards.
In recent years, research has focused on exploring new technologies for comprehensive perception in intelligent vehicles, making it possible to predict driver intent over time. Jain et al. [
3] introduced the Autoregressive Input–Output HMM (AIO-HMM) model, which processes both internal (2D facial features) and external features, predicting potential driver actions seconds before driving maneuvers. They also provided the Brain4Cars dataset, consisting of 1180 miles of natural highway and city driving, for method evaluation.
The dataset’s video frames contain dynamic information on driver maneuver patterns and road traffic conditions. Gebert [
4] and Xing [
5], among others, conducted statistical analyses on different driving intent expressions, finding a high correlation between driving intent and driver maneuvers. When drivers are about to change their maneuver, they exhibit corresponding actions, such as head posture [
3,
6,
7,
8], specific maneuvers, and eye movements while checking the rearview mirror [
9,
10], providing crucial evidence for intent inference.
This work is significant, as it lays the foundation for driver assistance systems and proposes a method for predicting driver actions using dynamic visual data inside and outside the vehicle. It addresses the challenge of predicting driver operations seconds in advance, allowing for timely warnings to drivers and contributing to the development of next-generation advanced driver-assistance systems (ADASs), reducing road hazard risks.
Several researchers have proposed improvements to this pipeline. Jain et al. [
11] suggested a deep learning architecture based on recurrent neural networks (RNN-LSTMs), which upgrades internal driver features from 2D to 3D facial features, enhancing the accuracy of driver maneuver prediction by fusing information from multiple sensors. Moussaid et al. [
12] presented a method using driver facial information to predict lane-changing actions, implementing a model based on CNN-LSTMs for analyzing driver actions before lane changes. Tonutti et al. [
8] introduced a method based on Domain-Adversarial Recurrent Neural Networks (DA-RNNs), improving the generalization capability of driving manipulation prediction. Gebert et al. [
4] combined 3D-ResNet with an LSTM to predict driver intent by analyzing driver motion and external vehicle video data. Rong et al. [
13] proposed a driver intent prediction method based on monitoring internal and external scenes, achieving better prediction performance with fewer parameters.
Analyzing Gebert et al. [
4] and Rong et al. [
13], with both using two branches for internal and external video processing, reveals their distinct approach. While Gebert et al. [
4] computed optical flow from internal videos, Rong et al. [
13] calculated it from external videos. These studies differ from previous approaches that use numerical data (e.g., lane numbers, speed) as external features. Instead, they directly extract external features from external videos by using CNN models.
Previously, LSTMs played a crucial role in driver maneuver recognition due to their ability to capture long-distance dependencies. However, LSTMs face challenges in capturing extended dependencies, and they present other challenges, such as high computational complexity, susceptibility to video noise, and interpretability issues. Recent studies favor 3D-CNN models for spatiotemporal feature extraction, addressing LSTMs’ limitations. However, learning effective spatiotemporal representations remains a challenge due to local redundancy and global dependence issues.
A combination of 3D convolutional neural networks (3D-CNNs) and spatiotemporal transformers has emerged as a promising solution for better driver intent inference. However, both have limitations. While 3D-CNNs reduce spatiotemporal redundancy, their finite receptive fields make learning long-term dependencies difficult. Spatiotemporal transformers excel at capturing global dependencies but introduce redundancy in shallow layers when encoding local spatiotemporal features.
Additionally, two challenges affect driver intent inference accuracy. First, inadequate utilization of external video information limits the perception and understanding of the surrounding environment. Rong et al. [
13] demonstrated that external videos complement internal driver videos, providing essential information. This external information is necessary to avoid misidentification in challenging situations. Second, the imbalance in training data samples, with straight driving maneuvers being more common than turns and lane changes, poses difficulties in machine learning model training. This imbalance may cause the model to favor dominant classes, reducing accuracy in predicting minority classes.
Inspired by transformer models, we propose the Spatial–Temporal Joint Attention Network (STA-Net), combining CNN and transformers in a dual-stream framework. The contributions of this study can be summarized as follows:
- (1)
We propose a two-stream network to extract in-cabin driver behavior and out-cabin environmental information, addressing spatiotemporal redundancy and insufficient use of driving scene information.
- (2)
We employ the joint learning of the CNN and the transformer to fuse spatiotemporal information at different levels. CNN focuses on low-level local features to reduce redundancy, while the transformer captures high-level global information to address long-term dependencies.
- (3)
We introduce an asymmetric loss function to tackle the problem of imbalanced training data, reducing the negative impact of sample imbalance on model optimization.
3. Methods
The spatiotemporal joint reasoning process of driving intention is a solution to the sequence image classification problem. In our work, we propose a novel driving intention inference framework, Spatial–Temporal Joint Attention Network (STA-NET), which simultaneously utilizes two input sources: internal and external videos, as shown in
Figure 1. One branch learns spatial semantic features in traffic videos. In contrast, the other branch learns spatial semantic features in driver videos, thereby addressing the deficiency of relying solely on in-vehicle driver spatiotemporal features for driving intention inference. As illustrated in
Figure 1, the backbone network is a parallel dual-branch network. The basic structure of the backbone network is mainly composed of Spatial–Temporal Joint Attention Block (STA Block) and Cross-Spatial Attention Module (CSAM), where the STA Block consists of Multi-Scale Transposed Attention (MSTA) and Multi-Scale Feedforward Network (MSFN). The STA Block adopts a joint CNN and transformer approach to simultaneously extract driver maneuver features and spatiotemporal features of the traffic scene. MSTA and MSFN alleviate the insufficient receptive field at different levels and enhance the richness of spatiotemporal feature information. At each stage, features extracted by the STA Block at the same scale are aggregated through CSAM. The MCFM (Multi-CSAM Fusion Module) aggregates different scales of in-vehicle driver features and traffic scene features at different stages for driving intention inference.
More specifically, we hierarchically stack STA Block units to construct our network for spatiotemporal learning. As shown in
Figure 1, our network comprises four stages with channel numbers 64, 128, 256, and 512, respectively. We build the backbone network of the STA framework based on the quantities of STA Block units in each stage, which are [
5,
7,
8,
20]. We employ MSTA (Equation (1)) at each STA Block to reduce spatiotemporal redundancy. We normalize the data using LN [
29]. Before the first stage, we apply a 3 × 4 × 4 convolution with a stride of 2 × 4 × 4, meaning both spatial and temporal dimensions are downsampled. Before the other stages, we use a 1 × 1 × 2 convolution with a stride of 1 × 1 × 2. Finally, the spatiotemporal average pooling and fully connected layers are employed for the ultimate prediction. In this way, our STA-Net, with an insightful unified framework, addresses video redundancy and dependencies. Each module is detailed as follows.
3.1. Framework for Spatiotemporal Feature Extraction Based on Dual-Stream Networks
As previously mentioned, the innovative driving intent inference framework, STA-NET, simultaneously handles video data inside and outside the vehicle. One branch learns spatial semantic information from traffic videos, while the other learns from driver videos. The specific structures are described below.
3.1.1. STA-Block
To overcome spatiotemporal redundancy and dependency issues, we propose a novel module called Spatial–Temporal Joint Attention Block (STA-Block), as illustrated in
Figure 1. We leverage the fundamental Transformer architecture [
30] and tailor it specifically for efficient and effective spatiotemporal representation learning. Specifically, the STA-Block comprises two key modules: the Multi-Scale Transposed Attention (MSTA) and the Multi-Scale Feedforward Network (MSFN). Our MSTA adeptly addresses local video redundancy and global video dependencies by extracting features at different scales in both shallow and deep layers. Finally, we introduce a Feedforward Network (FFN) with two linear layers to enhance each token pointwise.
As mentioned above, we aim to address two main challenges: significant local redundancy and intricate global dependencies, aiming for efficient and effective spatiotemporal representation learning. However, existing methods, such as popular 3D CNNs and spatiotemporal transformers, often focus solely on one of these challenges. Therefore, we introduce a novel approach called Multi-Scale Transpose Attention (MSTA). Designed in a concise transformer format, MSTA seamlessly unifies 3D convolution and spatiotemporal self-attention, adeptly tackling video redundancy and dependencies at different levels in both shallow and deep layers.
Due to the substantial computational overhead of transformers, primarily from the self-attention layer, applying the traditional self-attention mechanism (SA) [
30,
31] becomes impractical for most video-understanding tasks. In the conventional self-attention mechanism, the time and memory complexity of key-query dot-product interactions grows quadratically with the spatial resolution of the input. To address this issue, we propose Multi-Scale Transpose Attention (MSTA), which exhibits linear complexity, as depicted in
Figure 2. The key distinction lies in MSTA applying self-attention across channels, calculating cross-covariance across channels to generate an attention map implicitly encoding global context. As another integral component of MSTA, we introduce depthwise convolution to emphasize 3D local context, performing this operation before computing feature covariance to generate a global attention map.
From the Layer Normalization tensor
, our Multi-Scale Transpose Attention (MSTA) initially generates query (
, key (
), and value (
) projections, enriching local contexts. This is achieved by applying a 1 × 1 × 1 convolution to aggregate spatiotemporal cross-channel context, followed by a 3 × 3 × 3 depthwise convolution to encode channel-level spatiotemporal context, resulting in
,
, and
. Here,
represents a 1 × 1 × 1 pointwise convolution, and
represents a 3 × 3 × 3 depthwise convolution. In summary, the MSTA process is defined as follows:
Here,
and
are the input and output feature maps, respectively. In this context, α is a learnable scaling parameter used to control the magnitude of the dot product between
and
before applying the SoftMax function. Similar to traditional multi-head self-attention [
31], we divide the number of channels into “heads” and independently learn attention maps in parallel.
- b.
Multi-Scale Feedforward Network
A conventional Feedforward Network (FN) [
30,
31] performs the same operation at each spatiotemporal position to transform features. This network utilizes two 1 × 1 × 1 convolutions, where the first convolution layer is employed to expand feature channels (typically expanded by a factor of γ = 4), and the second convolution layer reduces the channel count back to the original input dimensions.
The relationship between these two operations lies in that the former aims to increase the spatial dimension of features to capture advanced features of spatiotemporal characteristics more effectively. The latter’s task is to map these advanced features back to the original input dimensions to fuse them with the outputs of other layers. Although the goals of these two convolutional layers differ, they operate in the same space, namely the dimensions of the original input. Thus, they can perform dot product calculations in the same space, effectively merging the outputs of the two convolutional layers. This design allows the network to operate in different feature spaces and merge these feature spaces when necessary.
In this work, we made two fundamental modifications to the Feedforward Network (FN) to enhance representation learning: (1) introducing a gating mechanism and (2) adopting depthwise convolutions. The Multi-Scale Feedforward Network (MSFN) structure we designed is shown in
Figure 3.
The gating mechanism helps regulate the flow of information in the network hierarchy, enabling each layer to focus on finer image attributes. This mechanism is achieved through the element-wise product of the two parallel paths of the linear transformation layer, with one path passing through the GELU non-linearity [
32] activation. Similar to Multi-Scale Transpose Attention (MSTA), we also introduced depth-wise convolutions in Multi-Scale Feedforward Network (MSFN) to encode information from spatially adjacent positions, allowing the model to capture channel-specific information better. This helps improve the model’s ability to distinguish between different features, enabling it to learn more discriminative feature representations more effectively. Given that the input tensor
, the representation of MSFN is formulated as:
where
represents element-wise multiplication,
represents the GELU non-linearity, and LN is Layer Normalization [
29]. Overall, MSFN controls the flow of information at each hierarchical level in our pipeline, allowing each level to focus on complementary fine details with other levels. In other words, compared to MSTA, MSFN provides different roles, focused on enriching features with contextual information). In summary, MSFN can further blend token context at each spatiotemporal position to improve classification accuracy.
3.1.2. Cross-Spatial Attention Module
To address the challenge of integrating the features of in-car driver maneuvers and the traffic motion scene features at the same scale, we propose a novel module called the Cross-Spatial Attention Module (CSAM), as illustrated in
Figure 4.
As illustrated in
Figure 4, within each of the four stages of the STA backbone, the features of the in-car driver maneuver and the traffic motion scene are aggregated separately at the same scale. Specifically, we initially aggregate the in-car driver maneuver sequence feature (inside feature1) and the traffic motion scene sequence feature (outside feature1) using 3D-CNN. This allows the simultaneous consideration of different aspects of the input data in the spatiotemporal dimensions. As the motion features inside and outside the car often contain information at different levels and types, aggregating these features comprehensively captures the spatiotemporal relationships in the input data. By leveraging their complementarity, the model gains a more comprehensive understanding of the data’s characteristics, enhancing its expressive power. Furthermore, feature fusion enables the model to better adapt to various input variations and noise, contributing to improved robustness across different in-car driver environments and external traffic conditions.
To further explore and aggregate the extracted features after the fusion by 3D-CNN, we introduce a self-attention mechanism in CSAM. This allows more effective capture of long-distance dependencies between different positions in spatiotemporal data, enhancing the model’s understanding of the global structure. By combating positional biases and dynamically focusing on the importance of different spatiotemporal points, the model’s performance and expressive capabilities are improved when dealing with sequential or volumetric data.
3.1.3. Multi-CSAM Fusion Module
To comprehensively capture information, enhance expressiveness, improve robustness, and mitigate overfitting risks, we performed a secondary fusion of features extracted from different stages and scales, as depicted in the Multi-CSAM Fusion Module (MCFM) in
Figure 1.
Specifically, we started by downsampling the CSAM1, CSAM2, and CSAM3, fused by the CSAM module at different stages, to match the scale of CSAM4. Subsequently, we applied the same spatial dimension reduction operation to each scale of features, using spatiotemporal average pooling. The dimension-reduced features from all scales were then concatenated to form a single feature vector. A dropout layer was also introduced after the final average pooling layer to prevent overfitting. Finally, the concatenated feature vector was input into fully connected layers to output the ultimate prediction for driver intent recognition.
This process of making different-scale features consistent facilitates parameter sharing, improving computational efficiency, ensuring dimension consistency, and avoiding information loss. It simplifies the model’s learning task, reduces the risk of overfitting, and enhances the model’s generalization ability and performance across various tasks.
By fusing features from different scales and stages for driver intent recognition, the model comprehensively captures different levels and details of input data, improving its global understanding of data features. Fusing features from different scales enhances the model’s expressive power, enabling it to better learn and represent complex data patterns. Furthermore, the fusion of features from multiple scales helps improve the model’s robustness to scale and structural variations, making it more resilient in different environments. Multi-scale feature fusion also aids in reducing the model’s parameter count, lowering the risk of overfitting, and enhancing its generalization performance. In summary, the multi-scale feature fusion followed by classification through fully connected layers is well-suited for driver intent recognition.
3.2. Asymmetric Loss
According to statistics on driving paths, the straight driving maneuver is more common compared to turning and lane-changing, resulting in the issue of class imbalance across various samples. This imbalance challenges machine learning models during training, as the dominant class samples are more abundant in the training set. The model may tend to learn features and patterns of the dominant class more strongly, leading to insufficient learning for minority classes and increased difficulty in training the model.
We introduce a weighted cross-entropy loss function to address the problem of imbalanced training data in driver intent inference and to ensure that the model adapts better to the sample distribution of different classes while ensuring a more balanced contribution of the loss function to each class. In this context, the loss for each class is multiplied by weight, and we adjust the loss for each sample by setting the weight to the reciprocal of the number of samples in the training set, followed by calculating the average loss. This method helps prevent overfitting to classes with a more significant number of samples, ensuring that each class appropriately impacts the overall loss for more effective model training. Specifically, for the 5-class classification of driver intent inference, the weighted cross-entropy loss function can be defined as follows:
where
is the total number of training samples.
is the weight for class j, set as the reciprocal of the number of samples for that class in the training set.
is whether the i-th sample in the actual labels belongs to class j.
is the model’s predicted output, representing the probability that sample i belongs to class j.
Setting the weights in this manner helps address the issue of imbalanced samples and improves the prediction accuracy for each class. The specific weight adjustments should be tuned and experimented with based on the dataset’s characteristics and the task of finding the optimal weight configuration.
Our experiments showed that identifying certain maneuvers, such as left and right lane changes or turns, can be particularly challenging. This difficulty arises from the dataset’s distribution and the inherent characteristics of these maneuvers. Therefore, we adjusted the weights in our model to allocate more significance to these harder-to-recognize categories. Our approach involves increasing the weights for categories based on their difficulty level, with the current coefficients refined through continuous hyperparameter tuning.
Here, we provide an example weight set for the cross-entropy loss function. Based on the distribution of driving actions in our training dataset, as well as finer adjustments made according to the difficulty level of model recognition for different categories, we arrived at these parameter weights as follows (assuming a hypothetical distribution for illustration purposes):
Go Straight: weight = 0.3
Left Lane Change: weight = 1.5
Left Turn: weight = 1.2
Right Lane Change: weight = 1.0
Right Turn: weight = 1.0
We found that applying these weights enhanced the model’s prediction accuracy for minority classes and those more challenging to recognize, without significantly impacting overall performance.
4. Experiments and Analysis
4.1. Datasets and Evaluation Metrics
This section delineates the datasets utilized in the experiments and the evaluation criteria employed.
We evaluate the performance of our proposed maneuver prediction method using the publicly available Brain4Cars dataset [
33]. This dataset comprises 594 video segments, The Brain4Cars [
3] dataset includes driver observation videos (1088 px × 1920 px, 25 fps) and videos of the outside scenes (480 px × 720 px, 30 fps) recorded simultaneously. The dataset consists of five driving maneuver categories: go straight, left lane change, left turn, right lane change, and right turn. Moreover, samples with no simultaneous recordings of the inside and outside view are considered invalid and not further used in our study.
We use a 5-fold cross-validation for all the experiments in this work, aligning with previous works using the Brain4Cars dataset [
3,
4,
6,
7,
8]. The final evaluation metrics include average accuracy and F1 score, along with their standard deviations.
In this research, we utilize accuracy, F1 score, and confusion matrix to evaluate the driver intent recognition performance of both the proposed model and other models. Accuracy (Acc) and F1 score are computed using Equations (4) and (7) [
34,
35], outlined as follows:
In the equations, TP stands for true positive, indicating cases where both the true label and the predicted label are positive; TN stands for true negative, signifying instances where both the true label and the predicted label are negative; FP corresponds to false positive, representing situations where the true label is negative, but the predicted label is positive; and FN denotes false negative, indicating scenarios where the true label is positive, but the predicted label is negative. Pr and Re refer to precision and recall, computed using Equations (5) and (6), respectively, while the F1 score is the harmonic mean.
4.2. Experiment Environment
The training process of STA-NET adopts a transfer learning strategy. The backbone feature extraction network is initialized with weights from Kinetics-400 [
36] (a human action dataset). Subsequently, the entire framework is trained. The proposed method, implemented using the PyTorch deep learning framework, The version of PyTorch is 2.1.0, employs AdamW [
37] as the optimizer with a weight decay of 0.05. A cosine learning rate scheduler [
38] is utilized, setting the base learning rate to
. The resolutions of in-cabin and external vehicle camera streams are both set to 224 × 224. The model is trained for 200 epochs on an NVIDIA RTX 3090Ti GPU with 24 GB of memory. The Brain4Cars dataset [
3] records video sequences of driving maneuvers. A total of 625 samples are collected, comprising 234 forward samples, 124 left lane change samples, 58 left turn samples, 123 right lane change samples, and 55 right turn samples. Among these, 80% of the video sequences are used for training, and the remaining 20% for testing.
4.3. Pre-Processing and Data Augmentation
To pre-process the data, we first extracted frames from the videos and resized all inputs to a uniform resolution of 224 by 224 pixels. Subsequently, we applied several data augmentation techniques to enhance the dataset and increased the robustness of our model. These techniques include translating the images by 6 pixels in each direction, applying a flip-left-to-right (flipLR), which necessitates a corresponding label change (e.g., a ‘turning left’ label becomes ‘turning right’, or the driver’s behavior on the left side becomes that on the right side, which covers a broader range of usage scenarios.), and implementing cutout [
39]—a method that randomly masks out square regions of the image. Additionally, we employed Augmix [
40], which combines various augmentations such as auto contrast, equalization, posterization, and solarization to create a diverse set of training examples. This comprehensive augmentation approach not only amplified our dataset but also ensured broader scenario coverage. Further details of these techniques are discussed in the referenced literature.
4.4. Ablation Experiments
To thoroughly evaluate the practical impact of the modules in improving the performance of the baseline method in real-world scenarios, this study conducted ablation experiments.
Two ablation experiments were performed: (1) Comparison between the improved cross-entropy loss function (ICELF) and the regular cross-entropy loss function, assessing the accuracy of driver intent recognition results under these two scenarios. (2) Comparison of the accuracy of driver intent recognition results obtained by adding the CSAM and MCFM against directly using CSAM4. as shown in
Table 1.
We systematically tested the impacts of CSAM and MCFM and the improved cross-entropy loss function on the accuracy of driver intent recognition by modifying one component at a time while keeping the other unchanged. The experiments utilized both in-cabin and external camera views. According to the results in
Table 1, CSAM and MCFM increased the ACC score by 7.32% compared to the baseline model. Simultaneously, the improved cross-entropy loss function raised the ACC score by 0.81%. These results indicate that these two components complement each other in terms of performance and variance, playing crucial roles in enhancing the accuracy of driver intent recognition. When used together, their accuracy is 90.97%, with an F1 score of 89.37%.
4.5. Comparative Experiments of Existing Methods
In
Table 2, we summarized and compared our work with other relevant studies, utilizing the most commonly used metrics: precision, recall, and time-to-maneuver (TTM). Time-to-maneuver is defined as the time interval between the time of the model’s prediction with the greatest confidence and the actual start of the maneuver. It is observed that our model’s precision and recall are on par with the average levels.
Table 3 presents the comparative results of the Brain4Cars test set. We compared STA-NET with three widely used end-to-end studies [
4,
13] because they have demonstrated better performance in driver intent recognition tasks than traditional machine learning methods. In [
4,
13], researchers applied video action recognition methods to driver intent prediction, such as 3DResNet and ConvLSTM + 3DResNet. All models were pretrained using the Kinetics-400 dataset [
36], and all video data with a duration of 5 s were used as input. The table shows results for three different inputs, including only in-cabin driver maneuver videos, only external traffic scene videos, and both in-cabin and external videos. Average accuracy and F1 scores based on five-fold cross-validation are used to illustrate the performance of different methods, where “SD” denotes standard deviation.
Based on
Table 3, when exclusively using in-cabin driver maneuver data, all algorithms achieved satisfactory results, with accuracy ranging from 77 to 84%, making further improvements challenging. STA-NET achieved the highest accuracy of 90.97% and an F1 score of 89.37% when provided with dual perspectives—both inside and outside the vehicle. The experiment indicates that the accuracy of intent recognition increased by approximately 3.66% when the STA-NET model used both in-cabin driver maneuver and external traffic scene as inputs. This result strongly suggests that external traffic scene features and in-cabin driver maneuver features contain complementary information for driver intent recognition. Moreover, our model significantly reduces the number of parameters compared to previous methods. Given the potential computational costs associated with model complexity, models with low resource requirements are preferred for automotive applications. Our model achieves the extraction of valuable features with fewer parameters, facilitating easier deployment in resource-constrained environments such as onboard vehicle systems.
The confusion matrix for the proposed method is illustrated in
Figure 5, showcasing the classification performance for the five intents. The results indicate that lane-keeping and right lane change intent recognition achieved the most accurate results, with accuracy reaching 91.3% and 90.9%, respectively. The ability to recognize the intent to change lanes to the left is relatively lower, with an accuracy of approximately 70.8%, often confused with the intent to go straight. The accuracy for recognizing the intent to turn left is around 83.3%, and it is also prone to confusion with the intent to go straight. The accuracy for recognizing the intent to turn right is approximately 87.5%, with potential confusion with the intent to go straight and change lanes to the right.
Through an analysis of misclassified samples, three main reasons are proposed. Firstly, for some lane-keeping maneuvers, drivers may be more inclined to perform left-check actions to ensure safe driving, making it easy to infer them as left lane change or left turn. Secondly, some right lane change intents are similar to maneuvers during lane-keeping, suggesting that drivers might occasionally perform right lane changes while maintaining their lane, although infrequently, potentially confusing the model. Thirdly, some right turn intents are very similar to maneuvers during right lane changes, which can confuse the model. These inferences suggest that in more complex traffic scenarios, drivers’ maneuvers may adjust frequently based on the traffic conditions on both sides, posing a challenge for driver intent recognition.
Upon comparing our STA-NET’s performance with the method described in [
13], we acknowledge the variation in classification accuracy across different maneuvers. Specifically, the model in [
13] exhibits superior performance in identifying left lane changes, left turns, and right turns, whereas STA-NET struggles with right lane change predictions. This discrepancy raises a pertinent discussion on the feasibility and potential advantages of employing a multi-method approach for driver maneuver prediction.
Rationale for Multi-Method Approach:
The diversity in driving behaviors and the complexity of road scenarios necessitate a nuanced approach to maneuver recognition. A single model may not optimally capture the intricacies associated with various maneuvers due to differences in visual cues, driver intentions, and environmental contexts. Therefore, leveraging the strengths of different predictive models based on the predicted maneuver could enhance overall performance and reliability.
Methodological Considerations:
To explore this possibility, we propose a framework where the predictive model dynamically selects between STA-NET and alternative methods, like the one presented in [
13], based on the specific maneuver scenario. This selection could be informed by pre-defined criteria, such as the maneuver type, confidence levels of the predictions, or contextual factors like traffic density and road type.
Potential Benefits:
- (1)
Enhanced Accuracy: By aligning model selection with the maneuver’s characteristics, we anticipate improvements in prediction accuracy, particularly for maneuvers where STA-NET’s performance is currently lacking.
- (2)
Reduced False Positives/Negatives: A more tailored approach allows for finer discrimination between maneuvers, potentially reducing misclassifications and enhancing the system’s reliability.
- (3)
Adaptability: This strategy introduces a layer of adaptability, enabling the system to evolve and incorporate new methods or findings from ongoing research.