Efficient Roadside Vehicle Line-Pressing Identification in Intelligent Transportation Systems with Mask-Guided Attention

Qin, Yuxiang; Qi, Xinzhou; Hao, Ruochen; Sun, Tuo; Song, Jun

doi:10.3390/su17093845

Open AccessArticle

Efficient Roadside Vehicle Line-Pressing Identification in Intelligent Transportation Systems with Mask-Guided Attention

by

Yuxiang Qin

¹

,

Xinzhou Qi

²,

Ruochen Hao

^3,*

,

Tuo Sun

³

and

Jun Song

⁴

¹

School of Computer Science and Technology, Tongji University, Shanghai 201804, China

²

Zhaobian (Shanghai) Technology Co., Ltd., Shanghai 201804, China

³

Key Laboratory of Road and Traffic Engineering of the Ministry of Education, Tongji University, Shanghai 201804, China

⁴

Department of Geography, Hong Kong Baptist University, Hong Kong 999077, China

^*

Author to whom correspondence should be addressed.

Sustainability 2025, 17(9), 3845; https://doi.org/10.3390/su17093845

Submission received: 17 March 2025 / Revised: 16 April 2025 / Accepted: 21 April 2025 / Published: 24 April 2025

(This article belongs to the Special Issue Sustainable Intelligent Transportation: Cooperative Systems and Vehicle Automation)

Download

Browse Figures

Versions Notes

Abstract

:

Vehicle line-pressing identification from a roadside perspective is a challenging task in intelligent transportation systems. Factors such as vehicle pose and environmental lighting significantly affect identification performance, and the high cost of data collection further exacerbates the problem. Existing methods struggle to achieve robust results across different scenarios. To improve the robustness of roadside vehicle line-pressing identification, we propose an efficient method. First, we construct the first large-scale vehicle line-pressing dataset based on roadside cameras (VLPI-RC). Second, we design an end-to-end convolutional neural network that integrates vehicle and lane line mask features, incorporating a mask-guided attention module to focus on key regions relevant to line-pressing events. Finally, we introduce a binary balanced contrastive loss (BBCL) to improve the model’s ability to generate more discriminative features, addressing the class imbalance issue in binary classification tasks. Experimental results demonstrate that our method achieves 98.65% accuracy and 96.34% F1 on the VLPI-RC dataset. Moreover, when integrated into the YOLOv5 object detection framework, it attains an identification speed of 108.29 FPS. These results highlight the effectiveness of our approach in accurately and efficiently detecting vehicle line-pressing behaviors.

Keywords:

intelligent transportation systems; vehicle line-pressing identification; convolution neural network; mask-guided attention; imbalanced data classification

1. Introduction

The intelligent transportation system plays an important role in daily life. As a traffic control system, it integrates artificial intelligence technology, Internet of Things technology, cloud computing, intelligent hardware, and corresponding software systems. Intelligent transportation systems have been applied in many different fields, such as traffic flow detection and travel time prediction [1,2], vehicle identification [3,4], and accident detection and prevention [5,6]. Through the corresponding analysis models and algorithms in the software system, the collected data are processed to effectively realize the intelligent scheduling and management of urban road traffic, greatly improve the traffic conditions of urban roads, and improve the safety of people’s travel.

Vehicle line-pressing identification is the focus of research in the field of intelligent transportation. 5GAA’s report [7] mentions that autonomous vehicles, when detecting obstacles in their current lane, need to maneuver to avoid a collision, for example, when making a sudden lane change. This can result in collisions with adjacent or closely following vehicles in neighboring lanes. Detecting and warning about vehicle line-pressing events using vehicle-to-everything (V2X) technology can mitigate the risk of collisions, thereby enhancing safety. At the same time, in countries with a high density of vehicles, changing lanes by crossing solid lines is considered a violation of road traffic safety regulations. Traffic management authorities need to promptly detect vehicle behavior, preserve relevant image evidence, and impose penalties on violating vehicles. However, the urban road network has a complex structure and a large coverage area, so it is difficult to detect it using traditional manual methods. Therefore, the motivation of this paper is to design a simple and efficient method for recognizing vehicle line-pressing. This method aims to reduce the risk of collisions caused by sudden lane changes and enhance the ability of traffic management authorities to identify violating vehicles.

Currently, several methods [8,9,10,11,12,13,14] for vehicle line-pressing identification have been proposed. However, factors such as vehicle occlusion of lane lines, weather conditions, and environmental brightness, as shown in Figure 1, still significantly impact existing methods, leading to a substantial number of misjudgments. Through our research on previous methods, we have identified two key challenges affecting this task. (1) Insufficient Data: Vehicle line-pressing events are relatively uncommon in real-world driving scenarios. Collecting such data across diverse environments, such as intersections and highways, demands considerable cost and effort. Moreover, single-view data often suffer from occlusion, making accurate annotation difficult. In existing works, for instance, only 200 image samples were used in [12] to validate the proposed method, which is insufficient to draw reliable conclusions or to demonstrate model robustness. (2) Limited Generalization: Vehicles and lane lines serve as key information for identifying line-pressing behavior, and accurately capturing the correlation between them is essential for improving model performance. Previous methods have attempted to address this by applying direct threshold-based judgments [9,10] or extracting more complex vehicle pose information [12,13,14]. However, these approaches often suffer from poor transferability when applied to varying perspectives, road types, or geographical regions, limiting their generalization in real-world deployments.

In this paper, we propose an end-to-end vehicle line-pressing identification framework based on roadside camera data. To support this task, we construct the first large-scale dataset, VLPI-RC, consisting of 18,324 images and 34,516 labeled instances. Our method fuses vehicle and lane line features, enabling a deep convolutional neural network to learn feature representations of line-crossing vehicles, thereby achieving end-to-end processing. To enhance the model’s focus on relevant regions, we introduce a mask-guided attention module that incorporates lane line masks as prior information into the attention mechanism. Furthermore, to address class imbalance and improve discrimination of hard samples, we propose the BBCL. Experimental results demonstrate that our method achieves robust and accurate line-pressing identification across diverse traffic scenarios. In summary, our major contributions lie in four aspects:

We propose VLPI-RC, a large-scale dataset containing diverse scenarios, enabling a more comprehensive evaluation of model performance.
We propose a method that integrates vehicle features with lane line features, enabling end-to-end processing for vehicle line-pressing identification. This enhances the model’s efficiency, allowing it to adapt to more complex environments through automated feature learning.
We introduce a mask-guided attention mechanism, which utilizes lane line masks as prior information. This allows the model to more effectively capture the relationship between vehicle features and lane line features, focusing more on the key areas of vehicle line-pressing.
We propose BBCL to address the data imbalance issue and introduce a hard example mining strategy in contrastive learning, helping the model generate more discriminative features.

2. Related Work

The issue of vehicles occluding lane markings poses a significant challenge for detecting vehicle line-crossing using monocular cameras. To address this problem, various methods have been proposed to effectively process information on both vehicles and lane markings. Figure 2 presents a visual representation of some of these approaches. Lee et al. [8] proposed using inverse perspective transformation (IPM) to process the image of the road surface so that the lane lines can be parallel to each other after transformation. By setting a fixed detection area and combining with a tracking algorithm, it can be determined whether the vehicle has changed lanes. This method cannot achieve timely recognition because the lane line information is not used for calculation.

H.D et al. [9,10] proposed a method based on a 2D bounding box (bbox), as shown in Figure 2b. First, through the Canny edge detection algorithm and the standard Hough transform, the lane line information is transformed into a straight line in the image coordinate system. By calculating the distance from the center point of the bbox to the nearest straight line, if the distance is less than the artificially set threshold, the current vehicle is in the state of line-pressing. However, we found that relying solely on a single threshold to determine the vehicle’s line-pressing status lacks robustness and can lead to misclassification. This approach struggles to accurately capture the relationship between the vehicle and lane lines when handling various vehicle sizes and complex occlusions. SLDNet [11] realizes the instance segmentation of vehicles and lane lines through the convolutional neural network, as shown in Figure 2d, and inputs the segmentation results into the binary classification model. The method can adapt to different scenarios, but it still cannot solve the problem of vehicle occlusion of lane lines. Instance segmentation struggles to extract the features of occluded parts, so only a small number of features of lane lines are used for classification models. At the same time, using Mask R-CNN [15] and LaneNet [16] is time-consuming and is not suitable for real-time detection tasks.

In order to achieve more accurate recognition, Gao et al. [12], Wu et al. [13], and Zhou et al. [14] proposed methods that predict the quadrilateral outline of the vehicle chassis and incorporate lane line information. Specifically, Gao et al. [12] utilized the vehicle’s driving direction to obtain 2D chassis posture information based on the 2D bounding box from object detection. Building on this, Wu et al. [13] further optimized the vehicle’s driving direction through object tracking, thus avoiding inaccuracies in chassis information caused by manually setting the vehicle’s driving direction. Additionally, Zhou et al. [14] introduced 3D object detection in the image, directly obtaining the vehicle’s outline information using a 3D detection method, as shown in Figure 2c. However, since these methods are primarily designed for limited scenarios and target types, they are susceptible to various influencing factors. For example, the complex and variable viewpoints of roadside cameras as well as the inability to effectively handle large vehicles (such as buses and large semi-trailers) can lead to inaccurate predictions of the vehicle chassis, thereby affecting the method’s ability to recognize the vehicle’s line-pressing status.

In automatic driving, vehicle line-pressing identification is also a current research hotspot. Li et al. [17] proposed assessing driving risks in the context of lane-changing decisions for autonomous vehicles using temporal trajectory data and probabilistic models. However, this approach failed to perceive the lane-changing situations of surrounding vehicles. Biparva et al. [18] proposed the classification of lane-changing for surrounding vehicles in autonomous driving using spatial and temporal information from video data. Zhang et al. [19] proposed a method for detecting wheel lines, and the wheel lines were used to calculate whether there was an intersection with the lane line. Compared with the on-board camera, the position of the roadside camera is higher, and it cannot accurately collect the wheel information of the vehicle, and because of the larger field of view, it will contain more small targets.

3. VLPI-RC Dataset

The VLPI-RC dataset consists of four sub-datasets, and the detailed information of each is provided as follows.

BrnoCompSpeed [20]: This dataset uses traffic cameras to collect 21 videos, each of which is about 60 min long and has a resolution of 1920 × 1080. It contains a total of seven different scenes, and the same scene is divided into three perspectives: left, middle, and right, as shown in Figure 3a. We sampled 11,139 video frames for data annotation.
BIT-Vehicle [21]: This dataset contains 9850 vehicle images, with resolutions of 1600 × 1200 and 1920 × 1080. A total of 2196 images were annotated through selection. As shown in Figure 3b, the camera is positioned parallel to the lane, resulting in large vehicle targets in the images.
GRAM-RTM [22]: This dataset was collected from three different scenes. Considering the impact of image clarity, only the M-30-HD subset is used in our work, as shown in Figure 3c. By performing uniform sampling on consecutive frames, a total of 939 images were annotated.
Private Datasets: We collected 4032 images at a resolution of 1920 × 1080 using multiple roadside cameras deployed in Beijing and Shanghai, China, covering diverse traffic scenarios such as highways and urban intersections. As illustrated in Figure 3d, the dataset also includes recordings under varying environmental conditions, including rainy weather and nighttime, to comprehensively evaluate the robustness of the proposed method in real-world settings.

During the annotation process of the VLPI-RC dataset, we first identified the primary regions in each image where vehicle line-pressing identification was required. Within these regions, vehicle bounding boxes, line-pressing status, and corresponding lane line masks were annotated. To ensure high annotation quality, we adopted effective strategies for resolving ambiguous or complex scenarios. When vehicles or lane markings were partially occluded, bounding boxes and lane masks were estimated using contextual cues from adjacent frames. If the line-pressing status could not be directly determined, synchronized images from different camera views within the same scene were used to ensure accurate identification. All annotators involved in this process were well trained and had extensive experience in image annotation. Furthermore, all annotations underwent a second round of quality verification to ensure consistency and accuracy across the dataset. The final annotation statistics are summarized in Table 1.

4. Method

4.1. Overview

The overall pipeline of our method is shown in Figure 4. The proposed method is mainly divided into three stages: (1) inputting the vehicle image and lane line mask for feature fusion; (2) utilizing the lane line mask information to apply attention to key areas in the fused feature map; and (3) optimizing the network parameters using softmax loss and BBCL. Here, we briefly describe the entire processing procedure. First, we apply data augmentation to the input vehicle image

I_{S r c}

and the lane mask

I_{M a s k}

. The features of each image,

F_{S r c}

and

F_{M a s k}

, are extracted separately through a dual-branch feature extraction module. Subsequently, the features

F_{S r c}

and

F_{M a s k}

are fused to generate feature

F

, and the multi-scale feature map

F_{l}

is extracted through a neural network (such as ResNet50 [23]). Using the lane mask feature map

M_{l}

and the multi-scale feature map

F_{l}

, mask-guided attention is computed as

F_{l}^{'} = A t t e n t i o n (M_{l}, F_{l})

. Finally, based on the softmax loss, we introduce BBCL to ensure higher similarity for samples of the same class and lower similarity for samples of different classes. After multiple epochs of training, the network model ultimately produces the classification results for the vehicle line-pressing status.

4.2. Robust Input Augmentation

To enhance model robustness, we applied various data augmentation methods during training to address challenges such as weather conditions, camera image quality, and bounding box accuracy, thereby improving generalization and reducing overfitting, as shown in Figure 5. Specifically, random perturbation of bounding boxes involves independently scaling the height and width of each box by random factors from the range

[- 20 %, 20 %]

, aiming to improve robustness against inaccurate detections. Horizontal flipping is applied with a

50 %

probability to increase orientation diversity. Brightness adjustment scales image brightness using a factor randomly sampled from [0.5 to 1.5], simulating illumination changes across different times of day. Gaussian noise with a standard deviation randomly selected from [10 to 50] is added to simulate sensor or compression noise. Motion blur is applied using a linear kernel with a random size from [5 to 20] pixels to imitate the appearance of fast-moving objects.

4.3. Feature Fusion Module

To explore the correlation between the lane markings and the vehicle and to further enhance the significance of lane information in vehicle line-pressing classification, we utilize a dual-branch feature extraction pipeline, where the vehicle image

I_{S r c} \in R^{W \times H \times 3}

and the lane mask

I_{M a s k} \in R^{W \times H \times 1}

are input separately. Due to the different initial processing of data by various backbone networks, we use ResNet50 as an example. The feature extraction module mainly consists of a convolution layer, normalization layer, activation layer, and max pooling layer. Since the input data dimensions vary, the number of input channels in the convolution layers needs to be adjusted in the two branches, while ensuring consistency in the remaining structure. Through feature extraction, we obtain two feature maps,

F_{S r c}, F_{M a s k} \in R^{C \times W^{'} \times H^{'}}

, which are combined using element-wise fusion to generate fusion results:

\begin{matrix} F (c, i, j) = F_{S r c} (c, i, j) + F_{M a s k} (c, i, j) \end{matrix}

(1)

where

c \in {1, \dots, C}

represents the channel index, and

(i, j)

is the spatial position index, with

i \in {1, \dots, W^{'}}

and

j \in {1, \dots, H^{'}}

. The fusion result can preserve the original information of the vehicle and the lane line, and the deep structure of the convolutional neural network can fully capture the positional correlation between the two.

4.4. Mask-Guided Attention Module

Attention mechanisms have significantly improved visual recognition tasks by enabling the extraction of global contextual features [24]. However, traditional approaches rely solely on internal feature relationships and often underutilize prior knowledge, limiting their ability to focus on task-relevant regions [25,26,27]. As shown in Figure 6, we address this by introducing a lane mask

M

as a spatial prior to guide the attention mechanism, enhancing the model’s focus on lane line regions.

In the ResNet50, the feature map output by each residual block

R_{l}

is denoted

F_{l} \in R^{C_{l}^{'} \times W_{l}^{'} \times H_{l}^{'}}

, where

l = 1, 2, 3, 4

represents the four different residual blocks. To integrate the mask features into the output features of each residual block, we need to adjust the size of the mask feature map to match the output size of each residual block. Through a series of convolution, normalization, and activation operations, a mask feature map

M_{l} \in R^{C_{l}^{'} \times W_{l}^{'} \times H_{l}^{'}}

with the same dimensions as the residual block’s feature map is generated. We employ a

1 \times 1

convolution kernel in these operations to efficiently transform the feature dimensions while maintaining spatial alignment. We input the generated fused feature map

F_{l}

and the mask feature map

M_{l}

into the mask-guided attention module.

In the mask-guided attention module, we first reshape feature maps

F_{l}, M_{l} \to R^{C_{l}^{'} \times (W_{l}^{'} \times H_{l}^{'})}

so that the attention mechanism can be computed in the spatial dimension. Then, the mask feature map

M_{l}

is used to generate the query matrix Q, mapping the mask features into the query space through a linear transformation:

\begin{matrix} Q = W_{Q} \cdot M_{l} \end{matrix}

(2)

where

W_{Q} \in R^{D \times C_{l}^{'}}

is the linear projection matrix that maps the channel dimension

C_{l}^{'}

of the mask feature map to the query space of dimension D. To ensure consistency between the input and output feature map channels, we set

D = C_{l}^{'}

. Similarly, the fused feature map

F_{l}

is used to generate the key matrix K and the value matrix V:

\begin{matrix} K = W_{K} \cdot F_{l}, V = W_{V} \cdot F_{l} \end{matrix}

(3)

where

W_{K}, W_{V} \in R^{C_{l}^{'} \times C_{l}^{'}}

. Through the above transformation, we obtain

Q, K, V \in R^{C_{l}^{'} \times (W_{l}^{'} \times H_{l}^{'})}

. The correlation between different positions is computed using the dot product of the query matrix Q and the key matrix K, resulting in the attention weight A:

\begin{matrix} A = s o f t m a x (\frac{Q \cdot K^{T}}{\sqrt{d}}) \end{matrix}

(4)

where

\sqrt{d}

represents the dimension of the query and key vectors. Scaling the dot product by

\sqrt{d}

prevents the values from becoming too large, which could lead to excessively small gradients in the softmax function. By applying the attention weight matrix A to the value matrix V, we obtain the new enhanced feature map

F_{e n, l}

:

\begin{matrix} F_{e n, l} = A \cdot V \end{matrix}

(5)

Finally, the feature map

F_{e n, l} \to R^{C_{l}^{'} \times W_{l}^{'} \times H_{l}^{'}}

is reshaped back to its three-dimensional form. This processed feature map is then fed into the next layer of the network for further computation, allowing the hierarchical extraction of information to continue across subsequent layers.

4.5. Learning Balanced and Discriminative Features

Softmax loss [28] is typically used for multi-class classification tasks. However, it focuses only on individual sample classification and lacks constraints between samples, leading to poor intra-class feature compactness and limited ability to handle class imbalance. It can be defined as follows:

\begin{matrix} L_{s o f t m a x} = - \frac{1}{N} \sum_{i = 1}^{N} \sum_{j = 1}^{K} y_{i j} \cdot l o g (\frac{e^{z_{i j}}}{\sum_{k = 1}^{K} e^{z_{i k}}}) \end{matrix}

(6)

where N is the number of samples, K is the number of classes,

y_{i j}

is the one-hot encoded representation of the class of sample i (where

y_{i j} = 1

indicates that sample i belongs to class j), and

z_{i j}

represents the logits for sample i in class j.

To overcome these limitations, we propose BBCL. Specifically, we adopt a fixed sampling ratio in each mini-batch to address the class imbalance in the data. As shown in Figure 7, we sampled

2 n

instances from the normal class and

2 m

instances from the line-pressing class. The features of the samples are extracted using the backbone network and mapped to a 256-dimensional space through a fully connected layer. The data are divided into two parts based on indices to construct feature pairs

(f_{2 n - 1}^{(0)}, f_{2 n}^{(0)})

and

(f_{2 m - 1}^{(1)}, f_{2 m}^{(1)})

, and the Euclidean distance between the two features is calculated as

d_{i, j} = {∥ f_{i} - f_{j} ∥}_{2}

. Due to the range of distance calculations being

[0, + \infty)

, this can lead to significant numerical fluctuations during loss computation, thereby affecting the training stability of the model. Inspired by [29], we introduce a Gaussian distribution function to map the distances. Since our focus in binary classification is primarily on the aggregation of similar class samples, we simplify and redefine the mapping function as follows:

\begin{matrix} G (d) = 1 - e^{- β d^{2}} \end{matrix}

(7)

where

β

is a hyperparameter that controls the smoothness of the mapping and the convergence speed of the model. According to the experimental results in Section 5.4.1, we set

β = 0.05

in our experiments.

To better achieve intra-class cohesion, we introduce a hard example mining strategy. During this process, we simultaneously compute the maximum and average distances for each class of samples to ensure that the model focuses on the most difficult sample pairs during training, while also considering the overall relationships among the samples. We define this as follows:

\begin{matrix} D^{(0)} & = \frac{m a x {g_{1, 2}^{(0)}, \dots, g_{2 n - 1, 2 n}^{(0)}} + \frac{1}{N} \sum_{i = 1}^{N} g_{2 i - 1, 2 i}^{(0)}}{2} \\ D^{(1)} & = \frac{m a x {g_{1, 2}^{(1)}, \dots, g_{2 m - 1, 2 m}^{(1)}} + \frac{1}{M} \sum_{i = 1}^{M} g_{2 i - 1, 2 i}^{(1)}}{2} \end{matrix}

(8)

where

g_{i, j} = G (d_{i, j})

, and N and M are the total numbers of sample pairs for the corresponding classes. Then, BBCL can be defined as follows:

\begin{matrix} L_{B B C L} = w \cdot D^{(0)} + (1 - w) \cdot D^{(1)} \end{matrix}

(9)

w represents the weight of the majority class, which can be expressed as

w = 1 - \frac{n}{n + m}

, where n and m denote the number of majority and minority class samples in the mini-batch, respectively. Combining Equation (6) and the Equation (9), the total loss of the convolutional neural network can be written as

\begin{matrix} L_{t o t a l} & = (1 - λ) \cdot L_{s o f t m a x} + λ \cdot L_{B B C L} \end{matrix}

(10)

λ

controls the trade-off between the softmax loss and the BBCL, and we set

λ = 0.5

in our experiments.

5. Experiments

5.1. Implementation Details

For the VLPI-RC dataset, the entire dataset is divided into training, validation, and test sets in a ratio of 4:2:4. The introduction of a validation set helps to prevent model overfitting. Additionally, within each subset, the ratio of normal samples to line-pressing samples is maintained, as shown in Table 2. The input vehicle images and mask images are both resized to 128 × 128.

All models are implemented in PyTorch [30]. We use the SGD [31] optimizer with momentum set to 0.9 and weight decay set to 1. Each model is trained for 50 epochs. We use the cosine dynamic learning rate adjustment method (torch.optim.lr_scheduler.Cosine AnnealingLR), and the initial learning rate is 0.001; the final learning rate will drop to 0. We train and evaluate the model on a single GPU, and the detailed hardware information is CPU (Intel Xeon Gold 6326) and GPU (NVIDIA A40). The batch size is set to 128. In the training phase, the model does not use any pre-training parameters of the dataset, such as ImageNet, etc., and in the evaluation phase, it does not use any model compression algorithm (pruning, quantization, etc.) and optimization acceleration library (TensorRT, etc.). All code is based on native Python 3.8.

5.2. Performance Metric

Vehicle line-pressing identification is a binary classification task. Through the confusion matrix, as shown in Table 3, the classification results of the model can be better summarized. Among them, TP represents the number of normal vehicles correctly classified as normal, TN represents the number of line-pressing vehicles correctly classified as line-pressing, FN represents the number of normal vehicles misclassified as line-pressing, and FP represents the number of line-pressing vehicles misclassified as normal.

\begin{matrix} P P V = T P / (T P + F P) \\ N P V = T N / (T N + F N) \\ S P E = T N / (F P + T N) \\ S E N = T P / (T P + F N) \\ A C C = \frac{T P + T N}{T P + F P + F N + T N} \\ F 1 = \frac{2 \cdot S E N \cdot P P V}{S E N + P P V} \\ M C C = \frac{T P \cdot T N - F P \cdot F N}{\sqrt{(T P + F N) (T P + F P) (T N + F N) (T N + F P)}} \end{matrix}

(11)

We used eight metrics to verify the performance of our method: PPV (Positive Predictive Value), NPV (Negative Predictive Value), SPE (Specificity), SEN (Sensitivity), and ACC (Accuracy). These metrics briefly reflect the model’s performance by measuring the correctness of predictions, distinguishing true positives, true negatives, and overall accuracy. However, they exhibit limitations when dealing with imbalanced data, as they can be misleading by focusing too heavily on the majority class. Given these constraints, we focus on F1 (F1-Measure), MCC (Matthews Correlation Coefficient), and AUC (Area Under the ROC Curve), as these metrics provide a more balanced evaluation. F1 combines precision and recall, MCC assesses overall classification quality even for imbalanced data, and AUC evaluates the model’s discriminative ability across different thresholds, making them more suitable for our imbalanced dataset.

5.3. Comparison with State-of-the-Art Methods

The quantitative comparison results of the VLPI-RC dataset are shown in Table 4, where the results for each of the four subsets are presented separately. The BrnoCompSpeed [20] sub-dataset includes multiple scenes and the largest sample size among all datasets. Compared to existing methods, our approach achieves significant performance improvements, obtaining the highest F1 of 97.17%, MCC of 0.9654, and AUC of 99.84%. These results notably surpass the performance of Zheng et al. [14], who utilized 3D bounding box estimation for chassis localization. Moreover, due to the increased sample size and diverse scenarios, methods like those of Gao et al. [12] and Wu et al. [13], which rely on 2D bounding box-based chassis localization, exhibit further performance degradation. In the BIT-Vehicle [21] sub-dataset, where the sample size is smaller and the camera is positioned directly facing the lane, the performance gap between our approach and competing methods narrows. Nevertheless, our method still achieves the best results, with a notable 3.98% improvement in F1 compared to that of Zheng et al. [14]. For the GRAM-RTM [22] sub-dataset, which also has a limited sample size and includes only a single scenario, the camera’s distant positioning results in smaller vehicle sizes. It can be observed that the method proposed by H.D et al. [9,10], which determines the vehicle line-pressing state by calculating the pixel distance between the center of the 2D box and the lane line, performs poorly. The primary reason for this is its reliance on a single threshold, which fails to effectively account for the variations in the distance between the box center and the lane line for vehicles of different positions and types. Despite these challenges, our method achieves outstanding performance, with an F1 of 95.24%, MCC of 0.9472, and AUC of 99.86%. On our private dataset, which includes data collected from various urban intersections and highways under diverse environmental conditions such as rain and nighttime, our method demonstrates strong adaptability. It achieves the best results, with an F1 of 93.16%, MCC of 0.9109, and AUC of 98.98%, surpassing existing methods by a 7.87% improvement in F1. These results highlight our method’s robustness in handling complex scenarios and environmental variations. Moreover, using only vehicle images as input yields poor results, indicating that vehicle line-pressing events cannot be effectively identified without incorporating lane line information.

5.4. Ablation Study

5.4.1. Impact of Loss Function

We conducted an ablation study on the loss functions. As shown in Table 5, compared to the softmax loss [28], the inclusion of BBCL further improves the model’s performance, validating the effectiveness of BBCL. For the hyperparameter

β

used in the distance mapping in Equation (7), we performed a comparison experiment and found that the model achieves the best classification performance when

β = 0.05

. Additionally, as illustrated in Figure 8, we observed that the model achieves optimal performance when the weight

λ

for BBCL is within the range of

[0.4, 0.7]

. However, further increasing the weight

λ

beyond this range leads to a decline in performance. Finally, as shown in Figure 9, we present the classification results using different loss functions. Compared to softmax loss [28] and focal loss [32], BBCL achieves superior performance in challenging hard samples, further confirming its advantage in mining difficult samples.

5.4.2. Impact of Attention Mechanisms

To validate the effectiveness of the proposed mask-guided attention module in capturing key regional features, we applied Grad-CAM [33] on ResNet50 for visual analysis, as shown in Figure 10. For both normal and line-pressing vehicles, the original Layer4 heatmaps exhibit diffuse activation, lacking focus on critical areas. After integrating the attention module, the heatmaps better highlight regions around lane lines and vehicles, aligning with the task objective. Additionally, as shown in Table 6, the inclusion of the attention module boosts the F1 by 2.59% over the baseline ResNet50.

5.4.3. Impact of Backbone Networks

As shown in Table 7, we evaluated our method across several backbone networks: ResNet [23], DenseNet [34], ResNeSt [35], ShuffleNet [36,37], and MobileNet [38,39,40]. Deeper networks such as ResNet101, ResNeSt200, and DenseNet201 achieve high accuracy but incur higher computational costs, which reduces FPS. For example, ResNeSt200 achieves a higher F1 of 96.81%, but its FPS drops to 21.39. In contrast, lightweight networks like ShuffleNet and MobileNet are optimized for speed, with MobileNetV1 achieving 301.23 FPS and 94.27% F1. These results demonstrate that our method can adapt to both high-performance and lightweight models, offering a balance of accuracy and efficiency based on application needs.

5.4.4. Impact of Image Size

Table 8 shows the model performance at different input sizes. ResNet50 achieves the best performance at

256 \times 256

with 96.41% F1, while ShuffleNetV2 reaches 95.27% F1. However, larger input sizes increase computational costs, with ResNet50 dropping to 151.38 FPS at

256 \times 256

. While larger input sizes improve performance, they also demand more computational resources. Based on the results, an input size of

128 \times 128

strikes an optimal balance between accuracy and efficiency for practical deployment.

5.4.5. Impact of Data Quantity

Due to the low incidence of vehicle line-pressing and the high cost of data collection, we explored the impact of data quantity on performance, as shown in Table 9. By training the model with different proportions of the training set, using the same test set as in previous experiments, our method achieved 91.81% F1 on the test set (13,807 images) with only 20% of the training data (2758 images). These results demonstrate the method’s strong generalization ability with limited data and effective handling of data imbalance. Moreover, they highlight the method’s transferability across different scenarios, even with limited data diversity.

5.4.6. Impact of Inaccurate Bounding Boxes

Our method depends on the outputs from object detection models, we evaluated its performance when the 2D b-box results were inaccurate. We simulated inaccurate detections by randomly perturbing the bounding boxes, scaling them by different degrees. We conducted ten experiments for every perturbation level and averaged the results. As shown in Table 10, with increased perturbation levels, accuracy and F1-score generally showed a gradual decline, reflecting a consistent trend where larger perturbations caused reduced localization precision and recognition accuracy. At the 30% perturbation level, the bounding boxes’ coverage of the targets became notably poor, representing severe misalignment. Despite this significant challenge, our model retained robust performance, achieving an F1 of 92.65%.

5.5. Identification Speed

As shown in Table 11, we compared the inference speed of existing methods on both a high-performance GPU (NVIDIA A40) and an edge computing platform (NVIDIA Jetson AGX). For methods requiring object detection, we used Yolov5s [41] with a fixed input image size of 640 × 640 to ensure fair comparisons across experiments. For SLDNet [11], which requires vehicle segmentation results, we used Mask R-CNN [15] for object detection in accordance with the original paper. For the method of Zheng et al. [14], which requires estimating the 3D bounding box of vehicles, PGD [42] was used for object detection as specified in the original paper. When using MobileNetV1 as the backbone network, our method achieved an inference speed of 108.29 FPS on the NVIDIA A40 and 27.16 FPS on the NVIDIA Jetson AGX. Compared to existing methods, SLDNet [11] and the method of Zheng et al. [14] introduce more vehicle information, which leads to slower inference speeds. H.D et al. [9,10], Gao et al. [12], and Wu et al. [13]’s methods are faster than our method during the vehicle line-pressing identification phase because they do not require deep learning inference. However, as shown in the experimental results in Table 4, these methods do not demonstrate superior performance and have poor robustness when dealing with more complex scenarios. Therefore, our method balances both recognition accuracy and real-time inference capability.

6. Conclusions and Discussion

In this paper, we proposed an innovative approach for vehicle line-pressing identification that integrates vehicle and lane line features, enabling end-to-end processing and enhancing adaptability to complex environments through automated feature learning. To further improve the model’s ability to focus on key areas, we introduced a mask-guided attention mechanism, leveraging lane line masks as prior information to better capture the interaction between vehicle and lane line features. BBCL addresses the data imbalance issue and incorporates a hard example mining strategy to enhance feature discrimination. Furthermore, we constructed the VLPI-RC dataset, which includes diverse urban traffic intersection and highway scenarios under varying environmental conditions. Together, our contributions represent a significant step forward in improving the accuracy and robustness of vehicle line-pressing identification systems.

In practical deployment, we adopted a cloud–edge–end architecture. Image data from the roadside cameras (end devices) are transmitted in real time to the edge computing module, which assesses the lane-pressing status of vehicles in the current scene. It then uploads events to the traffic management platform for electronic policing to record violations and issue lane change warnings to surrounding vehicles. Notably, the lane line masks in this paper are extracted offline. This approach is based on two main considerations: (1) real-time lane line extraction consumes more computational resources and fails to handle cases where the vehicle completely occludes the lane lines; and (2) the viewpoint of the roadside cameras is fixed, and the lane line positions remain unchanged over time, making real-time extraction unnecessary. However, during continuous real-world testing, we found that weather factors such as strong winds could cause slight shifts in the camera’s angle, leading to false alarms for vehicle line-pressing. Therefore, automatic correction and extraction methods for lane line masks have become a key focus of our research in the near term. These methods can prevent the resource consumption associated with real-time lane line extraction and enable timely adjustments to the mask information.

Author Contributions

Conceptualization, Y.Q., X.Q. and R.H.; methodology, Y.Q. and X.Q.; software, Y.Q.; validation, R.H., T.S. and J.S.; formal analysis, R.H.; investigation, J.S.; resources, T.S.; data curation, Y.Q. and X.Q.; writing—original draft preparation, Y.Q.; writing—review and editing, R.H. and T.S.; visualization, X.Q.; supervision, J.S.; project administration, T.S.; funding acquisition, R.H. and T.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Major Science and Technology Project of Gansu Province (22ZD6GA010), the Shanghai Sailing Program (22YF1452600, 22YF1452700), and the National Natural Science Foundation of China (52402408).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this research are available on request from the corresponding author.

Acknowledgments

This research was supported by Zhaobian (Shanghai) Technology Co., Ltd. in data collection and annotation.

Conflicts of Interest

Author Xinzhou Qi was employed by the Zhaobian (Shanghai) Technology Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Mittal, U.; Chawla, P.; Tiwari, R. EnsembleNet: A hybrid approach for vehicle detection and estimation of traffic density based on faster R-CNN and YOLO models. Neural Comput. Appl. 2023, 35, 4755–4774. [Google Scholar] [CrossRef]
Jin, G.; Wang, M.; Zhang, J.; Sha, H.; Huang, J. STGNN-TTE: Travel time estimation via spatial–temporal graph neural network. Future Gener. Comput. Syst. 2022, 126, 70–81. [Google Scholar] [CrossRef]
Yao, A.; Huang, M.; Qi, J.; Zhong, P. Attention mask-based network with simple color annotation for UAV vehicle re-identification. IEEE Geosci. Remote Sens. Lett. 2021, 19, 8014705. [Google Scholar] [CrossRef]
Zhu, W.; Wang, Z.; Wang, X.; Hu, R.; Liu, H.; Liu, C.; Wang, C.; Li, D. A Dual Self-Attention mechanism for vehicle re-Identification. Pattern Recognit. 2023, 137, 109258. [Google Scholar] [CrossRef]
Yu, L.; Du, B.; Hu, X.; Sun, L.; Han, L.; Lv, W. Deep spatio-temporal graph convolutional network for traffic accident prediction. Neurocomputing 2021, 423, 135–147. [Google Scholar] [CrossRef]
Zhao, C.; Chang, X.; Xie, T.; Fujita, H.; Wu, J. Unsupervised anomaly detection based method of risk evaluation for road traffic accident. Appl. Intell. 2023, 53, 369–384. [Google Scholar] [CrossRef]
5GAA. C-V2X Use Cases and Service Level Requirements Volume II. 2023. Available online: https://5gaa.org/c-v2x-use-cases-and-service-level-requirements-volume-ii (accessed on 22 April 2005).
Lee, H.; Jeong, S.; Lee, J. Robust detection system of illegal lane changes based on tracking of feature points. IET Intell. Transp. Syst. 2013, 7, 20–27. [Google Scholar] [CrossRef]
HD, A.K.; Prabhakar, C. Vehicle abnormality detection and classification using model based tracking. Int. J. Adv. Res. Comput. Sci. 2017, 8, 842. [Google Scholar]
Arun Kumar, H.D.; Prabhakar, C.J. Detection and Tracking of Lane Crossing Vehicles in Traffic Video for Abnormality Analysis. Int. J. Eng. Adv. Technol. 2021, 10, 1–9. [Google Scholar] [CrossRef]
Zhou, Z.; Li, R.; Gao, Y.; Zhang, C.; Hei, X. SLDNet: A Branched, Spatio-Temporal Convolution Neural Network for Detecting Solid Line Driving Violation in Intelligent Transportation Systems. In Proceedings of the 2020 Information Communication Technologies Conference (ICTC), Nanjing, China, 29–31 May 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 313–317. [Google Scholar]
Gao, F.; Zhou, M.; Weng, L.; Lu, S. An automatic verification method for vehicle line-pressing violation based on CNN and geometric projection. J. Ambient. Intell. Humaniz. Comput. 2021, 14, 1889–1901. [Google Scholar] [CrossRef]
Wu, S.; Ge, F.; Zhang, Y. A Vehicle Line-Pressing Detection Approach Based on YOLOv5 and DeepSort. In Proceedings of the 2022 IEEE 22nd International Conference on Communication Technology (ICCT), Nanjing, China, 11–14 November 2022; pp. 1745–1749. [Google Scholar] [CrossRef]
Zheng, G.; Lin, J.; Qin, Y.; Tan, B. A novel vehicle line-pressing detection framework based on 3D object detection. In Proceedings of the Fourth International Conference on Signal Processing and Computer Science (SPCS 2023), Guilin, China, 25–27 August 2023; SPIE: Bellingham, WA, USA, 2023; Volume 12970, pp. 243–250. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar] [CrossRef]
Neven, D.; De Brabandere, B.; Georgoulis, S.; Proesmans, M.; Van Gool, L. Towards end-to-end lane detection: An instance segmentation approach. In Proceedings of the 2018 IEEE Intelligent Vehicles Symposium (IV), Changshu, China, 26–30 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 286–291. [Google Scholar]
Li, G.; Qiu, Y.; Yang, Y.; Li, Z.; Li, S.; Chu, W.; Green, P.; Li, S.E. Lane Change Strategies for Autonomous Vehicles: A Deep Reinforcement Learning Approach Based on Transformer. IEEE Trans. Intell. Veh. 2023, 8, 2197–2211. [Google Scholar] [CrossRef]
Biparva, M.; Fernández-Llorca, D.; Gonzalo, R.I.; Tsotsos, J.K. Video Action Recognition for Lane-Change Classification and Prediction of Surrounding Vehicles. IEEE Trans. Intell. Veh. 2022, 7, 569–578. [Google Scholar] [CrossRef]
Zhang, X.; Li, Y.; Zhan, R.; Chen, J.; Li, J. The Line Pressure Detection for Autonomous Vehicles Based on Deep Learning. J. Adv. Transp. 2022, 2022, 4489770. [Google Scholar] [CrossRef]
Sochor, J.; Juránek, R.; Špaňhel, J.; Maršík, L.; Široký, A.; Herout, A.; Zemčík, P. Comprehensive Data Set for Automatic Single Camera Visual Speed Measurement. IEEE Trans. Intell. Transp. Syst. 2019, 20, 1633–1643. [Google Scholar] [CrossRef]
Dong, Z.; Wu, Y.; Pei, M.; Jia, Y. Vehicle Type Classification Using a Semisupervised Convolutional Neural Network. IEEE Trans. Intell. Transp. Syst. 2015, 16, 2247–2256. [Google Scholar] [CrossRef]
Guerrero-Gomez-Olmedo, R.; Lopez-Sastre, R.J.; Maldonado-Bascon, S.; Fernandez-Caballero, A. Vehicle Tracking by Simultaneous Detection and Viewpoint Estimation. In Proceedings of the 5th International Work-Conference on the Interplay Between Natural and Artificial Computation, IWINAC 2013, Mallorca, Spain, 10–14 June 2013; pp. 306–316. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, Red Hook, NY, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar]
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7794–7803. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3146–3154. [Google Scholar]
Williams, C.K.; Barber, D. Bayesian classification with Gaussian processes. IEEE Trans. Pattern Anal. Mach. Intell. 1998, 20, 1342–1351. [Google Scholar] [CrossRef]
Qin, Y.; Yan, C.; Liu, G.; Li, Z.; Jiang, C. Pairwise Gaussian loss for convolutional neural networks. IEEE Trans. Ind. Inform. 2020, 16, 6324–6333. [Google Scholar] [CrossRef]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Volume 32. [Google Scholar]
Sutskever, I.; Martens, J.; Dahl, G.; Hinton, G. On the importance of initialization and momentum in deep learning. In Proceedings of the International Conference on Machine Learning, Atlanta, GA, USA, 17–19 June 2013; pp. 1139–1147. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
Zhang, H.; Wu, C.; Zhang, Z.; Zhu, Y.; Lin, H.; Zhang, Z.; Sun, Y.; He, T.; Mueller, J.; Manmatha, R.; et al. Resnest: Split-attention networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 2736–2746. [Google Scholar]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6848–6856. [Google Scholar] [CrossRef]
Ma, N.; Zhang, X.; Zheng, H.T.; Sun, J. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 116–131. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4510–4520. [Google Scholar]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
Jocher, G.; Chaurasia, A.; Stoken, A.; Borovec, J.; Kwon, Y.; Michael, K.; Fang, J.; Yifu, Z.; Wong, C.; Montes, D.; et al. ultralytics/yolov5: v7. 0-YOLOv5 SOTA Realtime Instance Segmentation; Zenodo: Geneva, Switzerland, 2022. [Google Scholar]
Wang, T.; Xinge, Z.; Pang, J.; Lin, D. Probabilistic and geometric depth: Detecting objects in perspective. In Proceedings of the Conference on Robot Learning PMLR, Auckland, New Zealand, 14–18 December 2022; pp. 1475–1485. [Google Scholar]

Figure 1. Some hard samples from roadside cameras. (a) Vehicle line-pressing features that are not obvious; (b) large vehicles occluding lane lines; (c) normal samples close to the decision boundary; (d) line-pressing samples close to the decision boundary; (e) the impact of weather and environmental brightness.

Figure 2. Visualization of existing methods. (a) Original vehicle information. (b) Method based on 2D bounding box. (c) Method based on 3D bounding box estimation. (d) Method based on semantic segmentation.

Figure 3. Samples from the proposed VLPI-RC dataset. The top image shows the original data captured by the roadside camera, and the bottom image presents the corresponding annotations. The red box indicates line-pressing, the green box indicates normal status, and the blue mask indicates the lane line markings. (a) BrnoCompSpeed. (b) BIT-Vehicle. (c) GRAM-RTM. (d) private dataset.

Figure 4. The overall network architecture for the proposed methods.

Figure 5. Visualization of data augmentation for the vehicle line-pressing dataset. The original image displays the bounding box of the vehicle target.

Figure 6. Illustration of our mask-guided attention.

Figure 7. An illustration of the data-sampling process and the calculation of distances between samples.

f_{i}

represents the features extracted by the backbone network, which are mapped to a 256-dimensional space through a fully connected layer.

Figure 7. An illustration of the data-sampling process and the calculation of distances between samples.

f_{i}

represents the features extracted by the backbone network, which are mapped to a 256-dimensional space through a fully connected layer.

Figure 8. Sensitivity analysis of weighting factor between softmax loss and BBCL.

Figure 9. Visualization results of different loss function on the VLPI-RC dataset. The first and second rows present the results of small and large vehicles located near the decision boundary. The third row shows results for vehicles with partially occluded bodies. The fourth row illustrates the results under nighttime conditions. The target vehicles are indicated by green boxes in the images. The text below indicates the true label of the vehicle and the results of different methods, with incorrectly classified results highlighted in red.

Figure 10. Feature heatmap visualizations based on the ResNet50 model. The data are divided into two categories: (a) normal vehicles and (b) line-pressing vehicles. Each category is presented in two columns: the left column shows the feature heatmaps from Layer4 of the ResNet50, while the right column displays the feature heatmaps after applying our proposed mask-guided attention module. In the heatmaps, red indicates high activation, while blue represents low activation or background areas.

Table 1. Detailed statistics of the VLPI-RC dataset. The imbalance ratio refers to the ratio of the normal sample to the line-pressing sample.

Dataset	Total Image	Total Sample	Normal Sample	Line-Pressing Sample	Imbalance Ratio
BrnoCompSpeed [20]	11,139	22,599	18,502	4097	4.51
BIT-Vehicle [21]	2196	2324	1795	529	3.39
GRAM-RTM [22]	939	3824	3438	386	8.90
Private Dataset	4050	5769	4439	1330	3.33
Total	18,324	34,516	28,174	6342	4.44

Table 2. The data distribution of the vehicle normal samples and line-pressing samples in the training, validation, and test sets.

Dataset	Training	Validation	Test
Dataset	N/L	N/L	N/L
BrnoCompSpeed	7400/1638	3701/820	7401/1639
BIT-Vehicle	718/211	359/106	718/212
GRAM-RTM	1375/154	688/78	1375/154
Private Dataset	1775/532	888/266	1776/532
Total	11,268/2535	5636/1270	11,270/2537

Table 3. Confusion matrix for vehicle line-pressing identification tasks.

Ground Truth	Predicted
Ground Truth	Normal Class	Line-Pressing Class
Normal Class	TP (True Positive)	FN (False Negative)
Line-Pressing Class	FP (False Positive)	TN (True Negative)

Table 4. Quantitative evaluation results of classification on the VLPI-RC dataset. All methods utilize the ResNet50 backbone network with input images sized at 128 × 128. ^† denotes using only the vehicle image as an input to simulate classification results from an object detection approach. Metrics marked with * indicate primary evaluation metrics, selected to address the data imbalance in the dataset.

Sub-Dataset Name	Method	PPV	NPV	SPE	SEN	ACC	F1 *	MCC *	AUC *
BrnoCompSpeed [20]	Original Image ^†	72.03	89.87	95.58	51.37	87.57	59.97	0.5391	88.54
	H.D et al. [9,10]	25.19	87.72	58.59	62.97	59.38	35.98	0.1668	60.78
	SLDNet [11]	90.64	96.98	98.03	86.21	95.88	88.37	0.8591	98.13
	Gao et al. [12]	76.32	96.29	94.26	83.59	92.32	79.79	0.7518	97.51
	Wu et al. [13]	77.23	96.35	94.53	83.83	92.59	80.40	0.7593	97.84
	Zheng et al. [14]	86.88	96.91	97.12	86.03	95.11	86.45	0.8347	98.17
	Ours	96.91	99.43	99.31	97.44	98.97	97.17	0.9654	99.84
BIT-Vehicle [21]	Original Image ^†	71.33	86.54	94.01	50.47	84.09	59.12	0.5074	87.59
	H.D et al. [9,10]	81.33	89.92	95.68	63.68	88.39	71.43	0.6503	79.68
	SLDNet [11]	96.79	95.83	99.16	85.38	96.02	90.73	0.8849	97.84
	Gao et al. [12]	97.42	96.88	99.30	89.15	96.99	93.10	0.9133	98.97
	Wu et al. [13]	95.59	97.66	98.75	91.98	97.20	93.75	0.9198	98.92
	Zheng et al. [14]	98.97	97.41	99.72	91.04	97.74	94.84	0.9353	98.88
	Ours	99.05	99.58	99.72	98.58	99.46	98.82	0.9847	99.97
GRAM-RTM [22]	Original Image ^†	83.67	94.97	98.84	53.25	94.24	65.08	0.6400	92.62
	H.D et al. [9,10]	12.06	92.50	44.87	67.53	47.16	20.47	0.0753	56.20
	SLDNet [11]	85.93	97.27	98.62	75.32	96.27	80.28	0.7843	96.29
	Gao et al. [12]	89.51	98.12	98.91	83.12	97.32	86.20	0.8478	97.10
	Wu et al. [13]	92.03	98.06	99.20	82.47	97.51	86.99	0.8577	97.32
	Zheng et al. [14]	87.90	98.83	98.62	89.61	97.71	88.75	0.8748	97.71
	Ours	93.17	99.71	99.20	97.40	99.02	95.24	0.9472	99.86
Private Dataset	Original Image ^†	60.00	83.01	92.68	36.65	79.77	45.51	0.3552	77.96
	H.D et al. [9,10]	32.85	87.06	55.69	72.37	59.53	45.19	0.2363	64.03
	SLDNet [11]	86.20	91.91	96.57	71.62	90.81	78.23	0.7298	85.45
	Gao et al. [12]	86.60	93.20	96.45	76.50	91.85	81.24	0.7630	85.74
	Wu et al. [13]	87.77	93.32	96.79	76.88	92.20	81.96	0.7729	86.04
	Zheng et al. [14]	90.51	94.38	97.47	80.64	93.59	85.29	0.8143	86.80
	Ours	91.64	98.41	97.41	94.74	96.79	93.16	0.9109	98.98

Table 5. Ablation study with different loss functions.

Loss Function	ACC	F1	MCC	AUC
Softmax [28]	97.86	94.07	0.9278	99.40
Focal Loss [32]	98.08	94.75	0.9358	99.66
Softmax + BBCL ( $β$ = 0.1)	98.51	95.99	0.9508	99.74
Softmax + BBCL ( $β$ = 0.05)	98.65	96.34	0.9551	99.75
Softmax + BBCL ( $β$ = 0.01)	98.57	96.14	0.9526	99.75

Table 6. Ablation results of mask-guided attention.

Method	ACC	F1	MCC	AUC
ResNet50	97.65	93.75	0.9235	99.62
ResNet50 (Mask-Guided Attention)	98.65	96.34	0.9551	99.75

Table 7. Comparison of the classification results of different backbone networks on the VLPI-RC dataset. The input size is 128 × 128. Params (M): The number of parameters. FLOPs (G): Giga floating-point operations per second. FPS (f/s): Frames per second.

BackBone	ACC	F1	Params	FLOPs	FPS
ResNet18	98.46	95.86	11.49	1.24	269.20
ResNet34	98.56	96.12	21.60	2.45	190.02
ResNet50	98.65	96.34	26.82	2.97	154.85
ResNet101	98.70	96.46	45.81	5.40	94.18
DenseNet121	98.13	94.95	7.92	1.95	77.94
DenseNet169	98.36	95.54	14.19	2.32	57.58
DenseNet201	98.49	95.92	20.59	2.96	48.62
ResNeSt50	98.63	96.25	28.77	4.02	74.85
ResNeSt101	98.73	96.55	49.66	7.92	41.37
ResNeSt200	98.82	96.81	71.59	12.67	21.39
ShuffleNetV1	98.07	94.80	1.75	0.13	183.07
ShuffleNetV2	98.20	95.12	1.66	0.11	174.30
MobileNetV1	97.86	94.27	4.17	0.45	301.23
MobileNetV2	98.14	94.99	2.59	0.22	209.08
MobileNetV3	97.78	94.04	3.11	0.16	148.45

Table 8. Comparison of the classification results of different input image sizes. Input Size: Resizing of the original image to the network model input size.

BackBone	Input Size	ACC	F1	FLOPs	FPS
ResNet50	32 × 32	97.46	93.20	0.19	164.53
	64 × 64	98.17	95.04	0.74	160.52
	128 × 128	98.65	96.34	2.97	156.65
	256 × 256	98.67	96.41	11.86	151.38
ShuffleNetV2	32 × 32	97.43	93.00	0.01	185.08
	64 × 64	97.82	94.18	0.03	184.92
	128 × 128	98.20	95.12	0.11	170.99
	256 × 256	98.27	95.27	0.43	164.40

Table 9. The results of our method when using a limited quantity of data. We randomly sample the training set proportionally, keeping the test set unchanged. The backbone network is based on ResNet50, with an input image size of 128 × 128. N/L represents the number of normal samples and the number of line-pressing samples in the training set.

Sample	N/L	ACC	F1	MCC	AUC
20%	2253/505	96.94	91.81	0.8994	99.03
40%	4507/1012	97.41	93.04	0.9146	99.19
60%	6760/1519	98.07	94.79	0.9361	99.57
80%	9014/2026	98.44	95.77	0.9482	99.57
100%	11,268/2535	98.65	96.34	0.9551	99.75

Table 10. The results of our method using inaccurate bounding boxes. The backbone network is based on ResNet50, with an input image size of 128 × 128. Perturbation level represents the percentage of random scaling applied to the ground truth (GT) bounding box dimensions.

Perturbation Level	ACC	F1	MCC	AUC
0% (GT BBox)	98.65	96.34	0.9551	99.75
5%	98.58	96.16	0.9530	99.75
10%	98.45	95.81	0.9486	99.71
15%	98.27	95.33	0.9427	99.66
20%	97.95	94.44	0.9319	99.63
25%	97.55	93.40	0.9190	99.53
30%	97.26	92.65	0.9098	99.42

Table 11. Comparison of model inference speed.

Method	Object Detection Method	Avg. Time on Object Detection	Line-Pressing Identification Method	Avg. Time for Line-Pressing Identification	Total FPS
NVIDIA A40
H.D et al. [9,10]	Yolov5s	5.7629 ms	Distance Calculation	0.0013 ms	173.42
SLDNet [11]	Mask R-CNN [15]	68.0569 ms	ResNet34	3.8103 ms	13.91
Gao et al. [12]	Yolov5s	5.7626 ms	Chassis Pose Fitting	0.0078 ms	173.25
Wu et al. [13]	Yolov5s	5.7611 ms	Chassis Pose Fitting	0.0083 ms	173.26
Zheng et al. [14]	PGD [42]	79.5759 ms	Overlap Determination	0.8299 ms	12.44
Ours	Yolov5s	5.7614 ms	MobileNetV1	3.4768 ms	108.29
	Yolov5m	7.8747 ms	MobileNetV1	3.4773 ms	88.10
	Yolov5l	10.3311 ms	MobileNetV1	3.4777 ms	72.41
	Yolov5x	15.3634 ms	MobileNetV1	3.4745 ms	53.11
NVIDIA Jetson AGX
H.D et al. [9,10]	Yolov5s	26.3693 ms	Distance Calculation	0.0073 ms	37.91
SLDNet [11]	Mask R-CNN [15]	402.1801 ms	ResNet34	11.1265 ms	2.41
Gao et al. [12]	Yolov5s	26.2754 ms	Chassis Pose Fitting	0.0493 ms	37.98
Wu et al. [13]	Yolov5s	26.7673 ms	Chassis Pose Fitting	0.0434 ms	37.29
Zheng et al. [14]	PGD [42]	487.7193 ms	Overlap Determination	0.8299 ms	2.04
Ours	Yolov5s	26.3865 ms	MobileNetV1	10.4323 ms	27.16
	Yolov5m	49.7241 ms	MobileNetV1	10.4362 ms	16.62
	Yolov5l	82.3505 ms	MobileNetV1	10.4377 ms	10.77
	Yolov5x	145.3109 ms	MobileNetV1	10.4335 ms	6.42

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Qin, Y.; Qi, X.; Hao, R.; Sun, T.; Song, J. Efficient Roadside Vehicle Line-Pressing Identification in Intelligent Transportation Systems with Mask-Guided Attention. Sustainability 2025, 17, 3845. https://doi.org/10.3390/su17093845

AMA Style

Qin Y, Qi X, Hao R, Sun T, Song J. Efficient Roadside Vehicle Line-Pressing Identification in Intelligent Transportation Systems with Mask-Guided Attention. Sustainability. 2025; 17(9):3845. https://doi.org/10.3390/su17093845

Chicago/Turabian Style

Qin, Yuxiang, Xinzhou Qi, Ruochen Hao, Tuo Sun, and Jun Song. 2025. "Efficient Roadside Vehicle Line-Pressing Identification in Intelligent Transportation Systems with Mask-Guided Attention" Sustainability 17, no. 9: 3845. https://doi.org/10.3390/su17093845

APA Style

Qin, Y., Qi, X., Hao, R., Sun, T., & Song, J. (2025). Efficient Roadside Vehicle Line-Pressing Identification in Intelligent Transportation Systems with Mask-Guided Attention. Sustainability, 17(9), 3845. https://doi.org/10.3390/su17093845

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Efficient Roadside Vehicle Line-Pressing Identification in Intelligent Transportation Systems with Mask-Guided Attention

Abstract

1. Introduction

2. Related Work

3. VLPI-RC Dataset

4. Method

4.1. Overview

4.2. Robust Input Augmentation

4.3. Feature Fusion Module

4.4. Mask-Guided Attention Module

4.5. Learning Balanced and Discriminative Features

5. Experiments

5.1. Implementation Details

5.2. Performance Metric

5.3. Comparison with State-of-the-Art Methods

5.4. Ablation Study

5.4.1. Impact of Loss Function

5.4.2. Impact of Attention Mechanisms

5.4.3. Impact of Backbone Networks

5.4.4. Impact of Image Size

5.4.5. Impact of Data Quantity

5.4.6. Impact of Inaccurate Bounding Boxes

5.5. Identification Speed

6. Conclusions and Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI