Next Article in Journal
Impact Paths of the Entrepreneurial Behavior of the Underclass Groups’ Involved in Urbanization: A Case Study of Zhejiang Province, China
Next Article in Special Issue
AI-Enabled Digital Twin Framework for Safe and Sustainable Intelligent Transportation
Previous Article in Journal
Artificial Intelligence, Technological Innovation, and Employment Transformation for Sustainable Development: Evidence from China
Previous Article in Special Issue
Carbon Emission Reduction Assessment of Ships in the Grand Canal Network Based on Synthetic Weighting and Matter-Element Extension Model
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Efficient Roadside Vehicle Line-Pressing Identification in Intelligent Transportation Systems with Mask-Guided Attention

1
School of Computer Science and Technology, Tongji University, Shanghai 201804, China
2
Zhaobian (Shanghai) Technology Co., Ltd., Shanghai 201804, China
3
Key Laboratory of Road and Traffic Engineering of the Ministry of Education, Tongji University, Shanghai 201804, China
4
Department of Geography, Hong Kong Baptist University, Hong Kong 999077, China
*
Author to whom correspondence should be addressed.
Sustainability 2025, 17(9), 3845; https://doi.org/10.3390/su17093845
Submission received: 17 March 2025 / Revised: 16 April 2025 / Accepted: 21 April 2025 / Published: 24 April 2025

Abstract

:
Vehicle line-pressing identification from a roadside perspective is a challenging task in intelligent transportation systems. Factors such as vehicle pose and environmental lighting significantly affect identification performance, and the high cost of data collection further exacerbates the problem. Existing methods struggle to achieve robust results across different scenarios. To improve the robustness of roadside vehicle line-pressing identification, we propose an efficient method. First, we construct the first large-scale vehicle line-pressing dataset based on roadside cameras (VLPI-RC). Second, we design an end-to-end convolutional neural network that integrates vehicle and lane line mask features, incorporating a mask-guided attention module to focus on key regions relevant to line-pressing events. Finally, we introduce a binary balanced contrastive loss (BBCL) to improve the model’s ability to generate more discriminative features, addressing the class imbalance issue in binary classification tasks. Experimental results demonstrate that our method achieves 98.65% accuracy and 96.34% F1 on the VLPI-RC dataset. Moreover, when integrated into the YOLOv5 object detection framework, it attains an identification speed of 108.29 FPS. These results highlight the effectiveness of our approach in accurately and efficiently detecting vehicle line-pressing behaviors.

1. Introduction

The intelligent transportation system plays an important role in daily life. As a traffic control system, it integrates artificial intelligence technology, Internet of Things technology, cloud computing, intelligent hardware, and corresponding software systems. Intelligent transportation systems have been applied in many different fields, such as traffic flow detection and travel time prediction [1,2], vehicle identification [3,4], and accident detection and prevention [5,6]. Through the corresponding analysis models and algorithms in the software system, the collected data are processed to effectively realize the intelligent scheduling and management of urban road traffic, greatly improve the traffic conditions of urban roads, and improve the safety of people’s travel.
Vehicle line-pressing identification is the focus of research in the field of intelligent transportation. 5GAA’s report [7] mentions that autonomous vehicles, when detecting obstacles in their current lane, need to maneuver to avoid a collision, for example, when making a sudden lane change. This can result in collisions with adjacent or closely following vehicles in neighboring lanes. Detecting and warning about vehicle line-pressing events using vehicle-to-everything (V2X) technology can mitigate the risk of collisions, thereby enhancing safety. At the same time, in countries with a high density of vehicles, changing lanes by crossing solid lines is considered a violation of road traffic safety regulations. Traffic management authorities need to promptly detect vehicle behavior, preserve relevant image evidence, and impose penalties on violating vehicles. However, the urban road network has a complex structure and a large coverage area, so it is difficult to detect it using traditional manual methods. Therefore, the motivation of this paper is to design a simple and efficient method for recognizing vehicle line-pressing. This method aims to reduce the risk of collisions caused by sudden lane changes and enhance the ability of traffic management authorities to identify violating vehicles.
Currently, several methods [8,9,10,11,12,13,14] for vehicle line-pressing identification have been proposed. However, factors such as vehicle occlusion of lane lines, weather conditions, and environmental brightness, as shown in Figure 1, still significantly impact existing methods, leading to a substantial number of misjudgments. Through our research on previous methods, we have identified two key challenges affecting this task. (1) Insufficient Data: Vehicle line-pressing events are relatively uncommon in real-world driving scenarios. Collecting such data across diverse environments, such as intersections and highways, demands considerable cost and effort. Moreover, single-view data often suffer from occlusion, making accurate annotation difficult. In existing works, for instance, only 200 image samples were used in [12] to validate the proposed method, which is insufficient to draw reliable conclusions or to demonstrate model robustness. (2) Limited Generalization: Vehicles and lane lines serve as key information for identifying line-pressing behavior, and accurately capturing the correlation between them is essential for improving model performance. Previous methods have attempted to address this by applying direct threshold-based judgments [9,10] or extracting more complex vehicle pose information [12,13,14]. However, these approaches often suffer from poor transferability when applied to varying perspectives, road types, or geographical regions, limiting their generalization in real-world deployments.
In this paper, we propose an end-to-end vehicle line-pressing identification framework based on roadside camera data. To support this task, we construct the first large-scale dataset, VLPI-RC, consisting of 18,324 images and 34,516 labeled instances. Our method fuses vehicle and lane line features, enabling a deep convolutional neural network to learn feature representations of line-crossing vehicles, thereby achieving end-to-end processing. To enhance the model’s focus on relevant regions, we introduce a mask-guided attention module that incorporates lane line masks as prior information into the attention mechanism. Furthermore, to address class imbalance and improve discrimination of hard samples, we propose the BBCL. Experimental results demonstrate that our method achieves robust and accurate line-pressing identification across diverse traffic scenarios. In summary, our major contributions lie in four aspects:
  • We propose VLPI-RC, a large-scale dataset containing diverse scenarios, enabling a more comprehensive evaluation of model performance.
  • We propose a method that integrates vehicle features with lane line features, enabling end-to-end processing for vehicle line-pressing identification. This enhances the model’s efficiency, allowing it to adapt to more complex environments through automated feature learning.
  • We introduce a mask-guided attention mechanism, which utilizes lane line masks as prior information. This allows the model to more effectively capture the relationship between vehicle features and lane line features, focusing more on the key areas of vehicle line-pressing.
  • We propose BBCL to address the data imbalance issue and introduce a hard example mining strategy in contrastive learning, helping the model generate more discriminative features.

2. Related Work

The issue of vehicles occluding lane markings poses a significant challenge for detecting vehicle line-crossing using monocular cameras. To address this problem, various methods have been proposed to effectively process information on both vehicles and lane markings. Figure 2 presents a visual representation of some of these approaches. Lee et al. [8] proposed using inverse perspective transformation (IPM) to process the image of the road surface so that the lane lines can be parallel to each other after transformation. By setting a fixed detection area and combining with a tracking algorithm, it can be determined whether the vehicle has changed lanes. This method cannot achieve timely recognition because the lane line information is not used for calculation.
H.D et al. [9,10] proposed a method based on a 2D bounding box (bbox), as shown in Figure 2b. First, through the Canny edge detection algorithm and the standard Hough transform, the lane line information is transformed into a straight line in the image coordinate system. By calculating the distance from the center point of the bbox to the nearest straight line, if the distance is less than the artificially set threshold, the current vehicle is in the state of line-pressing. However, we found that relying solely on a single threshold to determine the vehicle’s line-pressing status lacks robustness and can lead to misclassification. This approach struggles to accurately capture the relationship between the vehicle and lane lines when handling various vehicle sizes and complex occlusions. SLDNet [11] realizes the instance segmentation of vehicles and lane lines through the convolutional neural network, as shown in Figure 2d, and inputs the segmentation results into the binary classification model. The method can adapt to different scenarios, but it still cannot solve the problem of vehicle occlusion of lane lines. Instance segmentation struggles to extract the features of occluded parts, so only a small number of features of lane lines are used for classification models. At the same time, using Mask R-CNN [15] and LaneNet [16] is time-consuming and is not suitable for real-time detection tasks.
In order to achieve more accurate recognition, Gao et al. [12], Wu et al. [13], and Zhou et al. [14] proposed methods that predict the quadrilateral outline of the vehicle chassis and incorporate lane line information. Specifically, Gao et al. [12] utilized the vehicle’s driving direction to obtain 2D chassis posture information based on the 2D bounding box from object detection. Building on this, Wu et al. [13] further optimized the vehicle’s driving direction through object tracking, thus avoiding inaccuracies in chassis information caused by manually setting the vehicle’s driving direction. Additionally, Zhou et al. [14] introduced 3D object detection in the image, directly obtaining the vehicle’s outline information using a 3D detection method, as shown in Figure 2c. However, since these methods are primarily designed for limited scenarios and target types, they are susceptible to various influencing factors. For example, the complex and variable viewpoints of roadside cameras as well as the inability to effectively handle large vehicles (such as buses and large semi-trailers) can lead to inaccurate predictions of the vehicle chassis, thereby affecting the method’s ability to recognize the vehicle’s line-pressing status.
In automatic driving, vehicle line-pressing identification is also a current research hotspot. Li et al. [17] proposed assessing driving risks in the context of lane-changing decisions for autonomous vehicles using temporal trajectory data and probabilistic models. However, this approach failed to perceive the lane-changing situations of surrounding vehicles. Biparva et al. [18] proposed the classification of lane-changing for surrounding vehicles in autonomous driving using spatial and temporal information from video data. Zhang et al. [19] proposed a method for detecting wheel lines, and the wheel lines were used to calculate whether there was an intersection with the lane line. Compared with the on-board camera, the position of the roadside camera is higher, and it cannot accurately collect the wheel information of the vehicle, and because of the larger field of view, it will contain more small targets.

3. VLPI-RC Dataset

The VLPI-RC dataset consists of four sub-datasets, and the detailed information of each is provided as follows.
  • BrnoCompSpeed [20]: This dataset uses traffic cameras to collect 21 videos, each of which is about 60 min long and has a resolution of 1920 × 1080. It contains a total of seven different scenes, and the same scene is divided into three perspectives: left, middle, and right, as shown in Figure 3a. We sampled 11,139 video frames for data annotation.
  • BIT-Vehicle [21]: This dataset contains 9850 vehicle images, with resolutions of 1600 × 1200 and 1920 × 1080. A total of 2196 images were annotated through selection. As shown in Figure 3b, the camera is positioned parallel to the lane, resulting in large vehicle targets in the images.
  • GRAM-RTM [22]: This dataset was collected from three different scenes. Considering the impact of image clarity, only the M-30-HD subset is used in our work, as shown in Figure 3c. By performing uniform sampling on consecutive frames, a total of 939 images were annotated.
  • Private Datasets: We collected 4032 images at a resolution of 1920 × 1080 using multiple roadside cameras deployed in Beijing and Shanghai, China, covering diverse traffic scenarios such as highways and urban intersections. As illustrated in Figure 3d, the dataset also includes recordings under varying environmental conditions, including rainy weather and nighttime, to comprehensively evaluate the robustness of the proposed method in real-world settings.
During the annotation process of the VLPI-RC dataset, we first identified the primary regions in each image where vehicle line-pressing identification was required. Within these regions, vehicle bounding boxes, line-pressing status, and corresponding lane line masks were annotated. To ensure high annotation quality, we adopted effective strategies for resolving ambiguous or complex scenarios. When vehicles or lane markings were partially occluded, bounding boxes and lane masks were estimated using contextual cues from adjacent frames. If the line-pressing status could not be directly determined, synchronized images from different camera views within the same scene were used to ensure accurate identification. All annotators involved in this process were well trained and had extensive experience in image annotation. Furthermore, all annotations underwent a second round of quality verification to ensure consistency and accuracy across the dataset. The final annotation statistics are summarized in Table 1.

4. Method

4.1. Overview

The overall pipeline of our method is shown in Figure 4. The proposed method is mainly divided into three stages: (1) inputting the vehicle image and lane line mask for feature fusion; (2) utilizing the lane line mask information to apply attention to key areas in the fused feature map; and (3) optimizing the network parameters using softmax loss and BBCL. Here, we briefly describe the entire processing procedure. First, we apply data augmentation to the input vehicle image I S r c and the lane mask I M a s k . The features of each image, F S r c and F M a s k , are extracted separately through a dual-branch feature extraction module. Subsequently, the features F S r c and F M a s k are fused to generate feature F , and the multi-scale feature map F l is extracted through a neural network (such as ResNet50 [23]). Using the lane mask feature map M l and the multi-scale feature map F l , mask-guided attention is computed as F l = A t t e n t i o n ( M l , F l ) . Finally, based on the softmax loss, we introduce BBCL to ensure higher similarity for samples of the same class and lower similarity for samples of different classes. After multiple epochs of training, the network model ultimately produces the classification results for the vehicle line-pressing status.

4.2. Robust Input Augmentation

To enhance model robustness, we applied various data augmentation methods during training to address challenges such as weather conditions, camera image quality, and bounding box accuracy, thereby improving generalization and reducing overfitting, as shown in Figure 5. Specifically, random perturbation of bounding boxes involves independently scaling the height and width of each box by random factors from the range [ 20 % , 20 % ] , aiming to improve robustness against inaccurate detections. Horizontal flipping is applied with a 50 % probability to increase orientation diversity. Brightness adjustment scales image brightness using a factor randomly sampled from [0.5 to 1.5], simulating illumination changes across different times of day. Gaussian noise with a standard deviation randomly selected from [10 to 50] is added to simulate sensor or compression noise. Motion blur is applied using a linear kernel with a random size from [5 to 20] pixels to imitate the appearance of fast-moving objects.

4.3. Feature Fusion Module

To explore the correlation between the lane markings and the vehicle and to further enhance the significance of lane information in vehicle line-pressing classification, we utilize a dual-branch feature extraction pipeline, where the vehicle image I S r c R W × H × 3 and the lane mask I M a s k R W × H × 1 are input separately. Due to the different initial processing of data by various backbone networks, we use ResNet50 as an example. The feature extraction module mainly consists of a convolution layer, normalization layer, activation layer, and max pooling layer. Since the input data dimensions vary, the number of input channels in the convolution layers needs to be adjusted in the two branches, while ensuring consistency in the remaining structure. Through feature extraction, we obtain two feature maps, F S r c , F M a s k R C × W × H , which are combined using element-wise fusion to generate fusion results:
F ( c , i , j ) = F S r c ( c , i , j ) + F M a s k ( c , i , j )
where c { 1 , , C } represents the channel index, and ( i , j ) is the spatial position index, with i { 1 , , W } and j { 1 , , H } . The fusion result can preserve the original information of the vehicle and the lane line, and the deep structure of the convolutional neural network can fully capture the positional correlation between the two.

4.4. Mask-Guided Attention Module

Attention mechanisms have significantly improved visual recognition tasks by enabling the extraction of global contextual features [24]. However, traditional approaches rely solely on internal feature relationships and often underutilize prior knowledge, limiting their ability to focus on task-relevant regions [25,26,27]. As shown in Figure 6, we address this by introducing a lane mask M as a spatial prior to guide the attention mechanism, enhancing the model’s focus on lane line regions.
In the ResNet50, the feature map output by each residual block R l is denoted F l R C l × W l × H l , where l = 1 , 2 , 3 , 4 represents the four different residual blocks. To integrate the mask features into the output features of each residual block, we need to adjust the size of the mask feature map to match the output size of each residual block. Through a series of convolution, normalization, and activation operations, a mask feature map M l R C l × W l × H l with the same dimensions as the residual block’s feature map is generated. We employ a 1 × 1 convolution kernel in these operations to efficiently transform the feature dimensions while maintaining spatial alignment. We input the generated fused feature map F l and the mask feature map M l into the mask-guided attention module.
In the mask-guided attention module, we first reshape feature maps F l , M l R C l × ( W l × H l ) so that the attention mechanism can be computed in the spatial dimension. Then, the mask feature map M l is used to generate the query matrix Q, mapping the mask features into the query space through a linear transformation:
Q = W Q · M l
where W Q R D × C l is the linear projection matrix that maps the channel dimension C l of the mask feature map to the query space of dimension D. To ensure consistency between the input and output feature map channels, we set D = C l . Similarly, the fused feature map F l is used to generate the key matrix K and the value matrix V:
K = W K · F l , V = W V · F l
where W K , W V R C l × C l . Through the above transformation, we obtain Q , K , V R C l × ( W l × H l ) . The correlation between different positions is computed using the dot product of the query matrix Q and the key matrix K, resulting in the attention weight A:
A = s o f t m a x ( Q · K T d )
where d represents the dimension of the query and key vectors. Scaling the dot product by d prevents the values from becoming too large, which could lead to excessively small gradients in the softmax function. By applying the attention weight matrix A to the value matrix V, we obtain the new enhanced feature map F e n , l :
F e n , l = A · V
Finally, the feature map F e n , l R C l × W l × H l is reshaped back to its three-dimensional form. This processed feature map is then fed into the next layer of the network for further computation, allowing the hierarchical extraction of information to continue across subsequent layers.

4.5. Learning Balanced and Discriminative Features

Softmax loss [28] is typically used for multi-class classification tasks. However, it focuses only on individual sample classification and lacks constraints between samples, leading to poor intra-class feature compactness and limited ability to handle class imbalance. It can be defined as follows:
L s o f t m a x = 1 N i = 1 N j = 1 K y i j · l o g ( e z i j k = 1 K e z i k )
where N is the number of samples, K is the number of classes, y i j is the one-hot encoded representation of the class of sample i (where y i j = 1 indicates that sample i belongs to class j), and z i j represents the logits for sample i in class j.
To overcome these limitations, we propose BBCL. Specifically, we adopt a fixed sampling ratio in each mini-batch to address the class imbalance in the data. As shown in Figure 7, we sampled 2 n instances from the normal class and 2 m instances from the line-pressing class. The features of the samples are extracted using the backbone network and mapped to a 256-dimensional space through a fully connected layer. The data are divided into two parts based on indices to construct feature pairs ( f 2 n 1 ( 0 ) , f 2 n ( 0 ) ) and ( f 2 m 1 ( 1 ) , f 2 m ( 1 ) ) , and the Euclidean distance between the two features is calculated as d i , j = f i f j 2 . Due to the range of distance calculations being [ 0 , + ) , this can lead to significant numerical fluctuations during loss computation, thereby affecting the training stability of the model. Inspired by [29], we introduce a Gaussian distribution function to map the distances. Since our focus in binary classification is primarily on the aggregation of similar class samples, we simplify and redefine the mapping function as follows:
G ( d ) = 1 e β d 2
where β is a hyperparameter that controls the smoothness of the mapping and the convergence speed of the model. According to the experimental results in Section 5.4.1, we set β = 0.05 in our experiments.
To better achieve intra-class cohesion, we introduce a hard example mining strategy. During this process, we simultaneously compute the maximum and average distances for each class of samples to ensure that the model focuses on the most difficult sample pairs during training, while also considering the overall relationships among the samples. We define this as follows:
D ( 0 ) = m a x { g 1 , 2 ( 0 ) , , g 2 n 1 , 2 n ( 0 ) } + 1 N i = 1 N g 2 i 1 , 2 i ( 0 ) 2 D ( 1 ) = m a x { g 1 , 2 ( 1 ) , , g 2 m 1 , 2 m ( 1 ) } + 1 M i = 1 M g 2 i 1 , 2 i ( 1 ) 2
where g i , j = G ( d i , j ) , and N and M are the total numbers of sample pairs for the corresponding classes. Then, BBCL can be defined as follows:
L B B C L = w · D ( 0 ) + ( 1 w ) · D ( 1 )
w represents the weight of the majority class, which can be expressed as w = 1 n n + m , where n and m denote the number of majority and minority class samples in the mini-batch, respectively. Combining Equation (6) and the Equation (9), the total loss of the convolutional neural network can be written as
L t o t a l = ( 1 λ ) · L s o f t m a x + λ · L B B C L
λ controls the trade-off between the softmax loss and the BBCL, and we set λ = 0.5 in our experiments.

5. Experiments

5.1. Implementation Details

For the VLPI-RC dataset, the entire dataset is divided into training, validation, and test sets in a ratio of 4:2:4. The introduction of a validation set helps to prevent model overfitting. Additionally, within each subset, the ratio of normal samples to line-pressing samples is maintained, as shown in Table 2. The input vehicle images and mask images are both resized to 128 × 128.
All models are implemented in PyTorch [30]. We use the SGD [31] optimizer with momentum set to 0.9 and weight decay set to 1. Each model is trained for 50 epochs. We use the cosine dynamic learning rate adjustment method (torch.optim.lr_scheduler.Cosine AnnealingLR), and the initial learning rate is 0.001; the final learning rate will drop to 0. We train and evaluate the model on a single GPU, and the detailed hardware information is CPU (Intel Xeon Gold 6326) and GPU (NVIDIA A40). The batch size is set to 128. In the training phase, the model does not use any pre-training parameters of the dataset, such as ImageNet, etc., and in the evaluation phase, it does not use any model compression algorithm (pruning, quantization, etc.) and optimization acceleration library (TensorRT, etc.). All code is based on native Python 3.8.

5.2. Performance Metric

Vehicle line-pressing identification is a binary classification task. Through the confusion matrix, as shown in Table 3, the classification results of the model can be better summarized. Among them, TP represents the number of normal vehicles correctly classified as normal, TN represents the number of line-pressing vehicles correctly classified as line-pressing, FN represents the number of normal vehicles misclassified as line-pressing, and FP represents the number of line-pressing vehicles misclassified as normal.
P P V = T P / ( T P + F P ) N P V = T N / ( T N + F N ) S P E = T N / ( F P + T N ) S E N = T P / ( T P + F N ) A C C = T P + T N T P + F P + F N + T N F 1 = 2 · S E N · P P V S E N + P P V M C C = T P · T N F P · F N ( T P + F N ) ( T P + F P ) ( T N + F N ) ( T N + F P )
We used eight metrics to verify the performance of our method: PPV (Positive Predictive Value), NPV (Negative Predictive Value), SPE (Specificity), SEN (Sensitivity), and ACC (Accuracy). These metrics briefly reflect the model’s performance by measuring the correctness of predictions, distinguishing true positives, true negatives, and overall accuracy. However, they exhibit limitations when dealing with imbalanced data, as they can be misleading by focusing too heavily on the majority class. Given these constraints, we focus on F1 (F1-Measure), MCC (Matthews Correlation Coefficient), and AUC (Area Under the ROC Curve), as these metrics provide a more balanced evaluation. F1 combines precision and recall, MCC assesses overall classification quality even for imbalanced data, and AUC evaluates the model’s discriminative ability across different thresholds, making them more suitable for our imbalanced dataset.

5.3. Comparison with State-of-the-Art Methods

The quantitative comparison results of the VLPI-RC dataset are shown in Table 4, where the results for each of the four subsets are presented separately. The BrnoCompSpeed [20] sub-dataset includes multiple scenes and the largest sample size among all datasets. Compared to existing methods, our approach achieves significant performance improvements, obtaining the highest F1 of 97.17%, MCC of 0.9654, and AUC of 99.84%. These results notably surpass the performance of Zheng et al. [14], who utilized 3D bounding box estimation for chassis localization. Moreover, due to the increased sample size and diverse scenarios, methods like those of Gao et al. [12] and Wu et al. [13], which rely on 2D bounding box-based chassis localization, exhibit further performance degradation. In the BIT-Vehicle [21] sub-dataset, where the sample size is smaller and the camera is positioned directly facing the lane, the performance gap between our approach and competing methods narrows. Nevertheless, our method still achieves the best results, with a notable 3.98% improvement in F1 compared to that of Zheng et al. [14]. For the GRAM-RTM [22] sub-dataset, which also has a limited sample size and includes only a single scenario, the camera’s distant positioning results in smaller vehicle sizes. It can be observed that the method proposed by H.D et al. [9,10], which determines the vehicle line-pressing state by calculating the pixel distance between the center of the 2D box and the lane line, performs poorly. The primary reason for this is its reliance on a single threshold, which fails to effectively account for the variations in the distance between the box center and the lane line for vehicles of different positions and types. Despite these challenges, our method achieves outstanding performance, with an F1 of 95.24%, MCC of 0.9472, and AUC of 99.86%. On our private dataset, which includes data collected from various urban intersections and highways under diverse environmental conditions such as rain and nighttime, our method demonstrates strong adaptability. It achieves the best results, with an F1 of 93.16%, MCC of 0.9109, and AUC of 98.98%, surpassing existing methods by a 7.87% improvement in F1. These results highlight our method’s robustness in handling complex scenarios and environmental variations. Moreover, using only vehicle images as input yields poor results, indicating that vehicle line-pressing events cannot be effectively identified without incorporating lane line information.

5.4. Ablation Study

5.4.1. Impact of Loss Function

We conducted an ablation study on the loss functions. As shown in Table 5, compared to the softmax loss [28], the inclusion of BBCL further improves the model’s performance, validating the effectiveness of BBCL. For the hyperparameter β used in the distance mapping in Equation (7), we performed a comparison experiment and found that the model achieves the best classification performance when β = 0.05 . Additionally, as illustrated in Figure 8, we observed that the model achieves optimal performance when the weight λ for BBCL is within the range of [ 0.4 , 0.7 ] . However, further increasing the weight λ beyond this range leads to a decline in performance. Finally, as shown in Figure 9, we present the classification results using different loss functions. Compared to softmax loss [28] and focal loss [32], BBCL achieves superior performance in challenging hard samples, further confirming its advantage in mining difficult samples.

5.4.2. Impact of Attention Mechanisms

To validate the effectiveness of the proposed mask-guided attention module in capturing key regional features, we applied Grad-CAM [33] on ResNet50 for visual analysis, as shown in Figure 10. For both normal and line-pressing vehicles, the original Layer4 heatmaps exhibit diffuse activation, lacking focus on critical areas. After integrating the attention module, the heatmaps better highlight regions around lane lines and vehicles, aligning with the task objective. Additionally, as shown in Table 6, the inclusion of the attention module boosts the F1 by 2.59% over the baseline ResNet50.

5.4.3. Impact of Backbone Networks

As shown in Table 7, we evaluated our method across several backbone networks: ResNet [23], DenseNet [34], ResNeSt [35], ShuffleNet [36,37], and MobileNet [38,39,40]. Deeper networks such as ResNet101, ResNeSt200, and DenseNet201 achieve high accuracy but incur higher computational costs, which reduces FPS. For example, ResNeSt200 achieves a higher F1 of 96.81%, but its FPS drops to 21.39. In contrast, lightweight networks like ShuffleNet and MobileNet are optimized for speed, with MobileNetV1 achieving 301.23 FPS and 94.27% F1. These results demonstrate that our method can adapt to both high-performance and lightweight models, offering a balance of accuracy and efficiency based on application needs.

5.4.4. Impact of Image Size

Table 8 shows the model performance at different input sizes. ResNet50 achieves the best performance at 256 × 256 with 96.41% F1, while ShuffleNetV2 reaches 95.27% F1. However, larger input sizes increase computational costs, with ResNet50 dropping to 151.38 FPS at 256 × 256 . While larger input sizes improve performance, they also demand more computational resources. Based on the results, an input size of 128 × 128 strikes an optimal balance between accuracy and efficiency for practical deployment.

5.4.5. Impact of Data Quantity

Due to the low incidence of vehicle line-pressing and the high cost of data collection, we explored the impact of data quantity on performance, as shown in Table 9. By training the model with different proportions of the training set, using the same test set as in previous experiments, our method achieved 91.81% F1 on the test set (13,807 images) with only 20% of the training data (2758 images). These results demonstrate the method’s strong generalization ability with limited data and effective handling of data imbalance. Moreover, they highlight the method’s transferability across different scenarios, even with limited data diversity.

5.4.6. Impact of Inaccurate Bounding Boxes

Our method depends on the outputs from object detection models, we evaluated its performance when the 2D b-box results were inaccurate. We simulated inaccurate detections by randomly perturbing the bounding boxes, scaling them by different degrees. We conducted ten experiments for every perturbation level and averaged the results. As shown in Table 10, with increased perturbation levels, accuracy and F1-score generally showed a gradual decline, reflecting a consistent trend where larger perturbations caused reduced localization precision and recognition accuracy. At the 30% perturbation level, the bounding boxes’ coverage of the targets became notably poor, representing severe misalignment. Despite this significant challenge, our model retained robust performance, achieving an F1 of 92.65%.

5.5. Identification Speed

As shown in Table 11, we compared the inference speed of existing methods on both a high-performance GPU (NVIDIA A40) and an edge computing platform (NVIDIA Jetson AGX). For methods requiring object detection, we used Yolov5s [41] with a fixed input image size of 640 × 640 to ensure fair comparisons across experiments. For SLDNet [11], which requires vehicle segmentation results, we used Mask R-CNN [15] for object detection in accordance with the original paper. For the method of Zheng et al. [14], which requires estimating the 3D bounding box of vehicles, PGD [42] was used for object detection as specified in the original paper. When using MobileNetV1 as the backbone network, our method achieved an inference speed of 108.29 FPS on the NVIDIA A40 and 27.16 FPS on the NVIDIA Jetson AGX. Compared to existing methods, SLDNet [11] and the method of Zheng et al. [14] introduce more vehicle information, which leads to slower inference speeds. H.D et al. [9,10], Gao et al. [12], and Wu et al. [13]’s methods are faster than our method during the vehicle line-pressing identification phase because they do not require deep learning inference. However, as shown in the experimental results in Table 4, these methods do not demonstrate superior performance and have poor robustness when dealing with more complex scenarios. Therefore, our method balances both recognition accuracy and real-time inference capability.

6. Conclusions and Discussion

In this paper, we proposed an innovative approach for vehicle line-pressing identification that integrates vehicle and lane line features, enabling end-to-end processing and enhancing adaptability to complex environments through automated feature learning. To further improve the model’s ability to focus on key areas, we introduced a mask-guided attention mechanism, leveraging lane line masks as prior information to better capture the interaction between vehicle and lane line features. BBCL addresses the data imbalance issue and incorporates a hard example mining strategy to enhance feature discrimination. Furthermore, we constructed the VLPI-RC dataset, which includes diverse urban traffic intersection and highway scenarios under varying environmental conditions. Together, our contributions represent a significant step forward in improving the accuracy and robustness of vehicle line-pressing identification systems.
In practical deployment, we adopted a cloud–edge–end architecture. Image data from the roadside cameras (end devices) are transmitted in real time to the edge computing module, which assesses the lane-pressing status of vehicles in the current scene. It then uploads events to the traffic management platform for electronic policing to record violations and issue lane change warnings to surrounding vehicles. Notably, the lane line masks in this paper are extracted offline. This approach is based on two main considerations: (1) real-time lane line extraction consumes more computational resources and fails to handle cases where the vehicle completely occludes the lane lines; and (2) the viewpoint of the roadside cameras is fixed, and the lane line positions remain unchanged over time, making real-time extraction unnecessary. However, during continuous real-world testing, we found that weather factors such as strong winds could cause slight shifts in the camera’s angle, leading to false alarms for vehicle line-pressing. Therefore, automatic correction and extraction methods for lane line masks have become a key focus of our research in the near term. These methods can prevent the resource consumption associated with real-time lane line extraction and enable timely adjustments to the mask information.

Author Contributions

Conceptualization, Y.Q., X.Q. and R.H.; methodology, Y.Q. and X.Q.; software, Y.Q.; validation, R.H., T.S. and J.S.; formal analysis, R.H.; investigation, J.S.; resources, T.S.; data curation, Y.Q. and X.Q.; writing—original draft preparation, Y.Q.; writing—review and editing, R.H. and T.S.; visualization, X.Q.; supervision, J.S.; project administration, T.S.; funding acquisition, R.H. and T.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Major Science and Technology Project of Gansu Province (22ZD6GA010), the Shanghai Sailing Program (22YF1452600, 22YF1452700), and the National Natural Science Foundation of China (52402408).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this research are available on request from the corresponding author.

Acknowledgments

This research was supported by Zhaobian (Shanghai) Technology Co., Ltd. in data collection and annotation.

Conflicts of Interest

Author Xinzhou Qi was employed by the Zhaobian (Shanghai) Technology Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  1. Mittal, U.; Chawla, P.; Tiwari, R. EnsembleNet: A hybrid approach for vehicle detection and estimation of traffic density based on faster R-CNN and YOLO models. Neural Comput. Appl. 2023, 35, 4755–4774. [Google Scholar] [CrossRef]
  2. Jin, G.; Wang, M.; Zhang, J.; Sha, H.; Huang, J. STGNN-TTE: Travel time estimation via spatial–temporal graph neural network. Future Gener. Comput. Syst. 2022, 126, 70–81. [Google Scholar] [CrossRef]
  3. Yao, A.; Huang, M.; Qi, J.; Zhong, P. Attention mask-based network with simple color annotation for UAV vehicle re-identification. IEEE Geosci. Remote Sens. Lett. 2021, 19, 8014705. [Google Scholar] [CrossRef]
  4. Zhu, W.; Wang, Z.; Wang, X.; Hu, R.; Liu, H.; Liu, C.; Wang, C.; Li, D. A Dual Self-Attention mechanism for vehicle re-Identification. Pattern Recognit. 2023, 137, 109258. [Google Scholar] [CrossRef]
  5. Yu, L.; Du, B.; Hu, X.; Sun, L.; Han, L.; Lv, W. Deep spatio-temporal graph convolutional network for traffic accident prediction. Neurocomputing 2021, 423, 135–147. [Google Scholar] [CrossRef]
  6. Zhao, C.; Chang, X.; Xie, T.; Fujita, H.; Wu, J. Unsupervised anomaly detection based method of risk evaluation for road traffic accident. Appl. Intell. 2023, 53, 369–384. [Google Scholar] [CrossRef]
  7. 5GAA. C-V2X Use Cases and Service Level Requirements Volume II. 2023. Available online: https://5gaa.org/c-v2x-use-cases-and-service-level-requirements-volume-ii (accessed on 22 April 2005).
  8. Lee, H.; Jeong, S.; Lee, J. Robust detection system of illegal lane changes based on tracking of feature points. IET Intell. Transp. Syst. 2013, 7, 20–27. [Google Scholar] [CrossRef]
  9. HD, A.K.; Prabhakar, C. Vehicle abnormality detection and classification using model based tracking. Int. J. Adv. Res. Comput. Sci. 2017, 8, 842. [Google Scholar]
  10. Arun Kumar, H.D.; Prabhakar, C.J. Detection and Tracking of Lane Crossing Vehicles in Traffic Video for Abnormality Analysis. Int. J. Eng. Adv. Technol. 2021, 10, 1–9. [Google Scholar] [CrossRef]
  11. Zhou, Z.; Li, R.; Gao, Y.; Zhang, C.; Hei, X. SLDNet: A Branched, Spatio-Temporal Convolution Neural Network for Detecting Solid Line Driving Violation in Intelligent Transportation Systems. In Proceedings of the 2020 Information Communication Technologies Conference (ICTC), Nanjing, China, 29–31 May 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 313–317. [Google Scholar]
  12. Gao, F.; Zhou, M.; Weng, L.; Lu, S. An automatic verification method for vehicle line-pressing violation based on CNN and geometric projection. J. Ambient. Intell. Humaniz. Comput. 2021, 14, 1889–1901. [Google Scholar] [CrossRef]
  13. Wu, S.; Ge, F.; Zhang, Y. A Vehicle Line-Pressing Detection Approach Based on YOLOv5 and DeepSort. In Proceedings of the 2022 IEEE 22nd International Conference on Communication Technology (ICCT), Nanjing, China, 11–14 November 2022; pp. 1745–1749. [Google Scholar] [CrossRef]
  14. Zheng, G.; Lin, J.; Qin, Y.; Tan, B. A novel vehicle line-pressing detection framework based on 3D object detection. In Proceedings of the Fourth International Conference on Signal Processing and Computer Science (SPCS 2023), Guilin, China, 25–27 August 2023; SPIE: Bellingham, WA, USA, 2023; Volume 12970, pp. 243–250. [Google Scholar]
  15. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar] [CrossRef]
  16. Neven, D.; De Brabandere, B.; Georgoulis, S.; Proesmans, M.; Van Gool, L. Towards end-to-end lane detection: An instance segmentation approach. In Proceedings of the 2018 IEEE Intelligent Vehicles Symposium (IV), Changshu, China, 26–30 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 286–291. [Google Scholar]
  17. Li, G.; Qiu, Y.; Yang, Y.; Li, Z.; Li, S.; Chu, W.; Green, P.; Li, S.E. Lane Change Strategies for Autonomous Vehicles: A Deep Reinforcement Learning Approach Based on Transformer. IEEE Trans. Intell. Veh. 2023, 8, 2197–2211. [Google Scholar] [CrossRef]
  18. Biparva, M.; Fernández-Llorca, D.; Gonzalo, R.I.; Tsotsos, J.K. Video Action Recognition for Lane-Change Classification and Prediction of Surrounding Vehicles. IEEE Trans. Intell. Veh. 2022, 7, 569–578. [Google Scholar] [CrossRef]
  19. Zhang, X.; Li, Y.; Zhan, R.; Chen, J.; Li, J. The Line Pressure Detection for Autonomous Vehicles Based on Deep Learning. J. Adv. Transp. 2022, 2022, 4489770. [Google Scholar] [CrossRef]
  20. Sochor, J.; Juránek, R.; Špaňhel, J.; Maršík, L.; Široký, A.; Herout, A.; Zemčík, P. Comprehensive Data Set for Automatic Single Camera Visual Speed Measurement. IEEE Trans. Intell. Transp. Syst. 2019, 20, 1633–1643. [Google Scholar] [CrossRef]
  21. Dong, Z.; Wu, Y.; Pei, M.; Jia, Y. Vehicle Type Classification Using a Semisupervised Convolutional Neural Network. IEEE Trans. Intell. Transp. Syst. 2015, 16, 2247–2256. [Google Scholar] [CrossRef]
  22. Guerrero-Gomez-Olmedo, R.; Lopez-Sastre, R.J.; Maldonado-Bascon, S.; Fernandez-Caballero, A. Vehicle Tracking by Simultaneous Detection and Viewpoint Estimation. In Proceedings of the 5th International Work-Conference on the Interplay Between Natural and Artificial Computation, IWINAC 2013, Mallorca, Spain, 10–14 June 2013; pp. 306–316. [Google Scholar]
  23. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  24. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, Red Hook, NY, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar]
  25. Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7794–7803. [Google Scholar]
  26. Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
  27. Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3146–3154. [Google Scholar]
  28. Williams, C.K.; Barber, D. Bayesian classification with Gaussian processes. IEEE Trans. Pattern Anal. Mach. Intell. 1998, 20, 1342–1351. [Google Scholar] [CrossRef]
  29. Qin, Y.; Yan, C.; Liu, G.; Li, Z.; Jiang, C. Pairwise Gaussian loss for convolutional neural networks. IEEE Trans. Ind. Inform. 2020, 16, 6324–6333. [Google Scholar] [CrossRef]
  30. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Volume 32. [Google Scholar]
  31. Sutskever, I.; Martens, J.; Dahl, G.; Hinton, G. On the importance of initialization and momentum in deep learning. In Proceedings of the International Conference on Machine Learning, Atlanta, GA, USA, 17–19 June 2013; pp. 1139–1147. [Google Scholar]
  32. Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
  33. Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]
  34. Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
  35. Zhang, H.; Wu, C.; Zhang, Z.; Zhu, Y.; Lin, H.; Zhang, Z.; Sun, Y.; He, T.; Mueller, J.; Manmatha, R.; et al. Resnest: Split-attention networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 2736–2746. [Google Scholar]
  36. Zhang, X.; Zhou, X.; Lin, M.; Sun, J. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6848–6856. [Google Scholar] [CrossRef]
  37. Ma, N.; Zhang, X.; Zheng, H.T.; Sun, J. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 116–131. [Google Scholar]
  38. Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
  39. Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4510–4520. [Google Scholar]
  40. Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
  41. Jocher, G.; Chaurasia, A.; Stoken, A.; Borovec, J.; Kwon, Y.; Michael, K.; Fang, J.; Yifu, Z.; Wong, C.; Montes, D.; et al. ultralytics/yolov5: v7. 0-YOLOv5 SOTA Realtime Instance Segmentation; Zenodo: Geneva, Switzerland, 2022. [Google Scholar]
  42. Wang, T.; Xinge, Z.; Pang, J.; Lin, D. Probabilistic and geometric depth: Detecting objects in perspective. In Proceedings of the Conference on Robot Learning PMLR, Auckland, New Zealand, 14–18 December 2022; pp. 1475–1485. [Google Scholar]
Figure 1. Some hard samples from roadside cameras. (a) Vehicle line-pressing features that are not obvious; (b) large vehicles occluding lane lines; (c) normal samples close to the decision boundary; (d) line-pressing samples close to the decision boundary; (e) the impact of weather and environmental brightness.
Figure 1. Some hard samples from roadside cameras. (a) Vehicle line-pressing features that are not obvious; (b) large vehicles occluding lane lines; (c) normal samples close to the decision boundary; (d) line-pressing samples close to the decision boundary; (e) the impact of weather and environmental brightness.
Sustainability 17 03845 g001
Figure 2. Visualization of existing methods. (a) Original vehicle information. (b) Method based on 2D bounding box. (c) Method based on 3D bounding box estimation. (d) Method based on semantic segmentation.
Figure 2. Visualization of existing methods. (a) Original vehicle information. (b) Method based on 2D bounding box. (c) Method based on 3D bounding box estimation. (d) Method based on semantic segmentation.
Sustainability 17 03845 g002
Figure 3. Samples from the proposed VLPI-RC dataset. The top image shows the original data captured by the roadside camera, and the bottom image presents the corresponding annotations. The red box indicates line-pressing, the green box indicates normal status, and the blue mask indicates the lane line markings. (a) BrnoCompSpeed. (b) BIT-Vehicle. (c) GRAM-RTM. (d) private dataset.
Figure 3. Samples from the proposed VLPI-RC dataset. The top image shows the original data captured by the roadside camera, and the bottom image presents the corresponding annotations. The red box indicates line-pressing, the green box indicates normal status, and the blue mask indicates the lane line markings. (a) BrnoCompSpeed. (b) BIT-Vehicle. (c) GRAM-RTM. (d) private dataset.
Sustainability 17 03845 g003
Figure 4. The overall network architecture for the proposed methods.
Figure 4. The overall network architecture for the proposed methods.
Sustainability 17 03845 g004
Figure 5. Visualization of data augmentation for the vehicle line-pressing dataset. The original image displays the bounding box of the vehicle target.
Figure 5. Visualization of data augmentation for the vehicle line-pressing dataset. The original image displays the bounding box of the vehicle target.
Sustainability 17 03845 g005
Figure 6. Illustration of our mask-guided attention.
Figure 6. Illustration of our mask-guided attention.
Sustainability 17 03845 g006
Figure 7. An illustration of the data-sampling process and the calculation of distances between samples. f i represents the features extracted by the backbone network, which are mapped to a 256-dimensional space through a fully connected layer.
Figure 7. An illustration of the data-sampling process and the calculation of distances between samples. f i represents the features extracted by the backbone network, which are mapped to a 256-dimensional space through a fully connected layer.
Sustainability 17 03845 g007
Figure 8. Sensitivity analysis of weighting factor between softmax loss and BBCL.
Figure 8. Sensitivity analysis of weighting factor between softmax loss and BBCL.
Sustainability 17 03845 g008
Figure 9. Visualization results of different loss function on the VLPI-RC dataset. The first and second rows present the results of small and large vehicles located near the decision boundary. The third row shows results for vehicles with partially occluded bodies. The fourth row illustrates the results under nighttime conditions. The target vehicles are indicated by green boxes in the images. The text below indicates the true label of the vehicle and the results of different methods, with incorrectly classified results highlighted in red.
Figure 9. Visualization results of different loss function on the VLPI-RC dataset. The first and second rows present the results of small and large vehicles located near the decision boundary. The third row shows results for vehicles with partially occluded bodies. The fourth row illustrates the results under nighttime conditions. The target vehicles are indicated by green boxes in the images. The text below indicates the true label of the vehicle and the results of different methods, with incorrectly classified results highlighted in red.
Sustainability 17 03845 g009
Figure 10. Feature heatmap visualizations based on the ResNet50 model. The data are divided into two categories: (a) normal vehicles and (b) line-pressing vehicles. Each category is presented in two columns: the left column shows the feature heatmaps from Layer4 of the ResNet50, while the right column displays the feature heatmaps after applying our proposed mask-guided attention module. In the heatmaps, red indicates high activation, while blue represents low activation or background areas.
Figure 10. Feature heatmap visualizations based on the ResNet50 model. The data are divided into two categories: (a) normal vehicles and (b) line-pressing vehicles. Each category is presented in two columns: the left column shows the feature heatmaps from Layer4 of the ResNet50, while the right column displays the feature heatmaps after applying our proposed mask-guided attention module. In the heatmaps, red indicates high activation, while blue represents low activation or background areas.
Sustainability 17 03845 g010
Table 1. Detailed statistics of the VLPI-RC dataset. The imbalance ratio refers to the ratio of the normal sample to the line-pressing sample.
Table 1. Detailed statistics of the VLPI-RC dataset. The imbalance ratio refers to the ratio of the normal sample to the line-pressing sample.
DatasetTotal
Image
Total
Sample
Normal
Sample
Line-Pressing
Sample
Imbalance
Ratio
BrnoCompSpeed [20]11,13922,59918,50240974.51
BIT-Vehicle [21]2196232417955293.39
GRAM-RTM [22]939382434383868.90
Private Dataset40505769443913303.33
Total18,32434,51628,17463424.44
Table 2. The data distribution of the vehicle normal samples and line-pressing samples in the training, validation, and test sets.
Table 2. The data distribution of the vehicle normal samples and line-pressing samples in the training, validation, and test sets.
DatasetTrainingValidationTest
N/LN/LN/L
BrnoCompSpeed7400/16383701/8207401/1639
BIT-Vehicle718/211359/106718/212
GRAM-RTM1375/154688/781375/154
Private Dataset1775/532888/2661776/532
Total11,268/25355636/127011,270/2537
Table 3. Confusion matrix for vehicle line-pressing identification tasks.
Table 3. Confusion matrix for vehicle line-pressing identification tasks.
Ground TruthPredicted
Normal ClassLine-Pressing Class
Normal ClassTP (True Positive)FN (False Negative)
Line-Pressing ClassFP (False Positive)TN (True Negative)
Table 4. Quantitative evaluation results of classification on the VLPI-RC dataset. All methods utilize the ResNet50 backbone network with input images sized at 128 × 128. denotes using only the vehicle image as an input to simulate classification results from an object detection approach. Metrics marked with * indicate primary evaluation metrics, selected to address the data imbalance in the dataset.
Table 4. Quantitative evaluation results of classification on the VLPI-RC dataset. All methods utilize the ResNet50 backbone network with input images sized at 128 × 128. denotes using only the vehicle image as an input to simulate classification results from an object detection approach. Metrics marked with * indicate primary evaluation metrics, selected to address the data imbalance in the dataset.
Sub-Dataset NameMethodPPVNPVSPESENACCF1 *MCC *AUC *
BrnoCompSpeed [20]Original Image 72.0389.8795.5851.3787.5759.970.539188.54
H.D et al. [9,10]25.1987.7258.5962.9759.3835.980.166860.78
SLDNet [11]90.6496.9898.0386.2195.8888.370.859198.13
Gao et al. [12]76.3296.2994.2683.5992.3279.790.751897.51
Wu et al. [13]77.2396.3594.5383.8392.5980.400.759397.84
Zheng et al. [14]86.8896.9197.1286.0395.1186.450.834798.17
Ours96.9199.4399.3197.4498.9797.170.965499.84
BIT-Vehicle [21]Original Image 71.3386.5494.0150.4784.0959.120.507487.59
H.D et al. [9,10]81.3389.9295.6863.6888.3971.430.650379.68
SLDNet [11]96.7995.8399.1685.3896.0290.730.884997.84
Gao et al. [12]97.4296.8899.3089.1596.9993.100.913398.97
Wu et al. [13]95.5997.6698.7591.9897.2093.750.919898.92
Zheng et al. [14]98.9797.4199.7291.0497.7494.840.935398.88
Ours99.0599.5899.7298.5899.4698.820.984799.97
GRAM-RTM [22]Original Image 83.6794.9798.8453.2594.2465.080.640092.62
H.D et al. [9,10]12.0692.5044.8767.5347.1620.470.075356.20
SLDNet [11]85.9397.2798.6275.3296.2780.280.784396.29
Gao et al. [12]89.5198.1298.9183.1297.3286.200.847897.10
Wu et al. [13]92.0398.0699.2082.4797.5186.990.857797.32
Zheng et al. [14]87.9098.8398.6289.6197.7188.750.874897.71
Ours93.1799.7199.2097.4099.0295.240.947299.86
Private DatasetOriginal Image 60.0083.0192.6836.6579.7745.510.355277.96
H.D et al. [9,10]32.8587.0655.6972.3759.5345.190.236364.03
SLDNet [11]86.2091.9196.5771.6290.8178.230.729885.45
Gao et al. [12]86.6093.2096.4576.5091.8581.240.763085.74
Wu et al. [13]87.7793.3296.7976.8892.2081.960.772986.04
Zheng et al. [14]90.5194.3897.4780.6493.5985.290.814386.80
Ours91.6498.4197.4194.7496.7993.160.910998.98
Table 5. Ablation study with different loss functions.
Table 5. Ablation study with different loss functions.
Loss FunctionACCF1MCCAUC
Softmax [28]97.8694.070.927899.40
Focal Loss [32]98.0894.750.935899.66
Softmax + BBCL ( β = 0.1)98.5195.990.950899.74
Softmax + BBCL ( β = 0.05)98.6596.340.955199.75
Softmax + BBCL ( β = 0.01)98.5796.140.952699.75
Table 6. Ablation results of mask-guided attention.
Table 6. Ablation results of mask-guided attention.
MethodACCF1MCCAUC
ResNet5097.6593.750.923599.62
ResNet50 (Mask-Guided Attention)98.6596.340.955199.75
Table 7. Comparison of the classification results of different backbone networks on the VLPI-RC dataset. The input size is 128 × 128. Params (M): The number of parameters. FLOPs (G): Giga floating-point operations per second. FPS (f/s): Frames per second.
Table 7. Comparison of the classification results of different backbone networks on the VLPI-RC dataset. The input size is 128 × 128. Params (M): The number of parameters. FLOPs (G): Giga floating-point operations per second. FPS (f/s): Frames per second.
BackBoneACCF1ParamsFLOPsFPS
ResNet1898.4695.8611.491.24269.20
ResNet3498.5696.1221.602.45190.02
ResNet5098.6596.3426.822.97154.85
ResNet10198.7096.4645.815.4094.18
DenseNet12198.1394.957.921.9577.94
DenseNet16998.3695.5414.192.3257.58
DenseNet20198.4995.9220.592.9648.62
ResNeSt5098.6396.2528.774.0274.85
ResNeSt10198.7396.5549.667.9241.37
ResNeSt20098.8296.8171.5912.6721.39
ShuffleNetV198.0794.801.750.13183.07
ShuffleNetV298.2095.121.660.11174.30
MobileNetV197.8694.274.170.45301.23
MobileNetV298.1494.992.590.22209.08
MobileNetV397.7894.043.110.16148.45
Table 8. Comparison of the classification results of different input image sizes. Input Size: Resizing of the original image to the network model input size.
Table 8. Comparison of the classification results of different input image sizes. Input Size: Resizing of the original image to the network model input size.
BackBoneInput SizeACCF1FLOPsFPS
ResNet5032 × 3297.4693.200.19164.53
64 × 6498.1795.040.74160.52
128 × 12898.6596.342.97156.65
256 × 25698.6796.4111.86151.38
ShuffleNetV232 × 3297.4393.000.01185.08
64 × 6497.8294.180.03184.92
128 × 12898.2095.120.11170.99
256 × 25698.2795.270.43164.40
Table 9. The results of our method when using a limited quantity of data. We randomly sample the training set proportionally, keeping the test set unchanged. The backbone network is based on ResNet50, with an input image size of 128 × 128. N/L represents the number of normal samples and the number of line-pressing samples in the training set.
Table 9. The results of our method when using a limited quantity of data. We randomly sample the training set proportionally, keeping the test set unchanged. The backbone network is based on ResNet50, with an input image size of 128 × 128. N/L represents the number of normal samples and the number of line-pressing samples in the training set.
SampleN/LACCF1MCCAUC
20%2253/50596.9491.810.899499.03
40%4507/101297.4193.040.914699.19
60%6760/151998.0794.790.936199.57
80%9014/202698.4495.770.948299.57
100%11,268/253598.6596.340.955199.75
Table 10. The results of our method using inaccurate bounding boxes. The backbone network is based on ResNet50, with an input image size of 128 × 128. Perturbation level represents the percentage of random scaling applied to the ground truth (GT) bounding box dimensions.
Table 10. The results of our method using inaccurate bounding boxes. The backbone network is based on ResNet50, with an input image size of 128 × 128. Perturbation level represents the percentage of random scaling applied to the ground truth (GT) bounding box dimensions.
Perturbation LevelACCF1MCCAUC
0% (GT BBox)98.6596.340.955199.75
5%98.5896.160.953099.75
10%98.4595.810.948699.71
15%98.2795.330.942799.66
20%97.9594.440.931999.63
25%97.5593.400.919099.53
30%97.2692.650.909899.42
Table 11. Comparison of model inference speed.
Table 11. Comparison of model inference speed.
MethodObject Detection MethodAvg. Time on Object DetectionLine-Pressing Identification MethodAvg. Time for Line-Pressing IdentificationTotal FPS
NVIDIA A40
H.D et al. [9,10]Yolov5s5.7629 msDistance Calculation0.0013 ms173.42
SLDNet [11]Mask R-CNN [15]68.0569 msResNet343.8103 ms13.91
Gao et al. [12]Yolov5s5.7626 msChassis Pose Fitting0.0078 ms173.25
Wu et al. [13]Yolov5s5.7611 msChassis Pose Fitting0.0083 ms173.26
Zheng et al. [14]PGD [42]79.5759 msOverlap Determination0.8299 ms12.44
OursYolov5s5.7614 msMobileNetV13.4768 ms108.29
Yolov5m7.8747 msMobileNetV13.4773 ms88.10
Yolov5l10.3311 msMobileNetV13.4777 ms72.41
Yolov5x15.3634 msMobileNetV13.4745 ms53.11
NVIDIA Jetson AGX
H.D et al. [9,10]Yolov5s26.3693 msDistance Calculation0.0073 ms37.91
SLDNet [11]Mask R-CNN [15]402.1801 msResNet3411.1265 ms2.41
Gao et al. [12]Yolov5s26.2754 msChassis Pose Fitting0.0493 ms37.98
Wu et al. [13]Yolov5s26.7673 msChassis Pose Fitting0.0434 ms37.29
Zheng et al. [14]PGD [42]487.7193 msOverlap Determination0.8299 ms2.04
OursYolov5s26.3865 msMobileNetV110.4323 ms27.16
Yolov5m49.7241 msMobileNetV110.4362 ms16.62
Yolov5l82.3505 msMobileNetV110.4377 ms10.77
Yolov5x145.3109 msMobileNetV110.4335 ms6.42
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Qin, Y.; Qi, X.; Hao, R.; Sun, T.; Song, J. Efficient Roadside Vehicle Line-Pressing Identification in Intelligent Transportation Systems with Mask-Guided Attention. Sustainability 2025, 17, 3845. https://doi.org/10.3390/su17093845

AMA Style

Qin Y, Qi X, Hao R, Sun T, Song J. Efficient Roadside Vehicle Line-Pressing Identification in Intelligent Transportation Systems with Mask-Guided Attention. Sustainability. 2025; 17(9):3845. https://doi.org/10.3390/su17093845

Chicago/Turabian Style

Qin, Yuxiang, Xinzhou Qi, Ruochen Hao, Tuo Sun, and Jun Song. 2025. "Efficient Roadside Vehicle Line-Pressing Identification in Intelligent Transportation Systems with Mask-Guided Attention" Sustainability 17, no. 9: 3845. https://doi.org/10.3390/su17093845

APA Style

Qin, Y., Qi, X., Hao, R., Sun, T., & Song, J. (2025). Efficient Roadside Vehicle Line-Pressing Identification in Intelligent Transportation Systems with Mask-Guided Attention. Sustainability, 17(9), 3845. https://doi.org/10.3390/su17093845

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop