A Multi-Level Adaptive Lightweight Net for Damaged Road Marking Detection Based on Knowledge Distillation

Wang, Junwei; Zeng, Xiangqiang; Wang, Yong; Ren, Xiang; Wang, Dongliang; Qu, Wenqiu; Liao, Xiaohan; Pan, Peifen

doi:10.3390/rs16142593

Open AccessArticle

A Multi-Level Adaptive Lightweight Net for Damaged Road Marking Detection Based on Knowledge Distillation

by

Junwei Wang

^1,2,3,†,

Xiangqiang Zeng

^1,4,†

,

Yong Wang

¹

,

Xiang Ren

¹,

Dongliang Wang

¹

,

Wenqiu Qu

^1,2

,

Xiaohan Liao

¹ and

Peifen Pan

^5,*

¹

State Key Laboratory of Resources and Environmental Information System, Institute of Geographic Sciences and Natural Resources Research, Chinese Academy of Sciences, Beijing 100101, China

²

University of Chinese Academy of Sciences, Beijing 100049, China

³

Beijing International Data Exchange, Beijing 100027, China

⁴

State Key Laboratory of Remote Sensing Science, Faculty of Geographical Science, Beijing Normal University, Beijing 100091, China

⁵

China Academy of Railway Sciences Group Co., Ltd., Beijing 100081, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Remote Sens. 2024, 16(14), 2593; https://doi.org/10.3390/rs16142593

Submission received: 27 April 2024 / Revised: 14 June 2024 / Accepted: 27 June 2024 / Published: 16 July 2024

(This article belongs to the Section Remote Sensing Image Processing)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

To tackle the complexity and limited applicability of high-precision segmentation models for damaged road markings, this study proposes a Multi-level Adaptive Lightweight Network (MALNet) based on knowledge distillation. By incorporating multi-scale dilated convolution and adaptive spatial channel attention fusion modules, the MALNet model significantly enhances the precision, integrity, and robustness of its segmentation branch. Furthermore, it employs an intricate knowledge distillation strategy, channeling rich, layered insights from a teacher model to a student model, thus elevating the latter’s segmentation ability. Concurrently, it streamlines the student model by markedly reducing its parameter count and computational demands, culminating in a segmentation network that is both high-performing and pragmatic. Rigorous testing on three distinct data sets for damaged road marking detection—CDM_P (Collective Damaged road Marking—Public), CDM_H (Collective Damaged road Marking—Highways), and CDM_C (Collective Damaged road Marking—Cityroad)—underscores the MALNet model’s superior segmentation abilities across all damage types, outperforming competing models in accuracy and completeness. Notably, the MALNet model excels in parameter efficiency, computational economy, and throughput. After distillation, the student model’s parameters and computational load decrease to only 31.78% and 27.40% of the teacher model’s, respectively, while processing speeds increase to 1.9 times, demonstrating a significant improvement in lightweight design.

Keywords:

remote sensing; damaged road marking; semantic segmentation; deep learning; knowledge distillation

1. Introduction

Road markings are pivotal in transmitting traffic information, serving as a fundamental safeguard against accidents and ensuring vehicular safety. Nevertheless, these markings are susceptible to deterioration from environmental and anthropogenic factors, manifesting as fractures, detachment, and obscurity over time, thereby introducing hazards to road safety [1]. Prompt detection and remediation of such impairments are vital for preserving traffic fluidity and vehicular security [2]. Recent advancements in remote sensing technologies, such as vehicle-mounted scanning and drones, have provided new opportunities for road marking detection [3]. These technologies offer advantages, such as rich spatial information and flexibility [4], making them promising for detecting damaged road markings from high-resolution optical images. Concurrently, the evolution of deep learning has spawned a plethora of potent image segmentation techniques, such as BiSeNet [5], SegNet [6], UperNet [7], LANet [8], MAResUNet [9], LinkNet [10], ReNeSt [11], ConvNeXt [12], SegFormer [13], and more, broadening the horizons for extracting damaged road markings from high-definition optical captures.

However, the direct transposition of these methodologies to the detection of damaged road markings reveals certain inadequacies. (1). Scenario incongruity. Tailored for diverse applications like vehicle recognition, medical imaging, and lane detection, these approaches have poor performance when confronted with the segmentation of fragmented, dense, minuscule, and blurred boundaries of damaged road markings. (2). Complexity of high-precision models. Deep Convolutional Neural Network (CNN) [14,15] architectures necessitate an extensive parameter set for training, demanding substantial computational power and time for both training and inference. Such models are untenable for deployment in specialized real-world contexts, such as live detection platforms, where temporal and storage exigencies are prohibitive. (3). Precision constraints of streamlined models. While lightweight CNN [16,17] frameworks excel in speed and storage economy, enabling their integration into resource-constrained mobile or embedded systems, their pared-down parameters and layers often preclude the attainment of anticipated accuracy levels. In summation, despite the commendable achievements of deep learning-driven image segmentation techniques in various domains, their applicability to damaged road marking detection is hampered by their limited precision and operational efficiency, falling short of the imperatives for streamlined deployment.

To address the challenges related to detecting damaged road markings, our research introduces an innovative lightweight detection methodology that combines multi-scale dilated convolution modules, adaptive spatial and channel attention modules, and a stratified distillation strategy. This approach adeptly amalgamates context information across diverse scales and strata, culminating in the meticulous delineation of damaged markings characterized by fragmented, density, minuteness, and blurred boundaries.

The principal contributions of our study include:

A streamlined model architecture—MALNet was formulated based on a layered distillation strategy. This strategy facilitates the transference of intricate knowledge from an elaborate teacher model to a simplified student model, securing a comparable level of segmentation precision with a reduced parameter set, thereby supporting the student model’s precision and operational efficiency.
The Multi-scale Dynamic Selection Convolution (MDSC) module was introduced, which utilizes convolutional kernels of various dimensions and dilation rates to dynamically adjust the receptive field within the feature extraction core, improving the handling of objects across a range of scales.
The Adaptive Spatial and Channel Attention (ASCA) module was developed, which skillfully adjusts feature weights based on spatial and channel information, significantly enhancing the model’s adaptability to diverse scenarios and improving both segmentation accuracy and system robustness.

The rest of this study is structured as follows. Section 2 offers a comprehensive overview of advancements in lightweight model techniques and deep learning-based road marking segmentation approaches, highlighting current limitations and suggesting potential improvements. Section 3 details the proposed methods in this investigation, focusing on the multi-scale dynamic selection convolution module, the adaptive spatial and channel attention module, and the multi-tiered distillation strategy. Finally, Section 4 presents the experimental framework and findings.

2. Related Work

This research aims to devise a high-precision, streamlined approach for segmenting deteriorated road markings. To this end, the study unfolds across two pivotal fronts: the deployment of deep learning paradigms for the segmentation of damaged road markings and the refinement of model light-weighting techniques.

2.1. Deep Learning-Driven Road Marking Segmentation Approaches

Harnessing the ability of deep neural networks, deep learning-driven road marking segmentation approaches deftly extract and delineate road markings from high-resolution optical imagery. These methods adeptly capitalize on the rich spatial and semantic cues within images, facilitating the exact detection and positioning of road markings. The prevalent approach pivots on an encoder–decoder neural network architecture, where the encoder progressively distills image features, which the decoder then projects back onto the original image’s spatial framework, enabling pixel-level categorization. Exemplars of this genre include U-Net [18], PSPNet [19], DeepLab [20], SegNet [6], MAResUNet [9], UperNet [7] and LinkNet [10], among others. These frameworks boast formidable feature extraction faculties and the capacity to apprehend contextual data across varying scales, enhancing segmentation fidelity and resilience via upsampling and skip connections. However, their Achilles’ heel lies in the fact that deep encoders are computationally intensive and there is a necessity for intricate training regimens.

Within the realm of image segmentation, Transformer methodologies have garnered acclaim for their stellar performance in augmenting receptive fields, amplifying contextual signals, and harnessing boundary intelligence [21]. The Swin Transformer [22], in particular, perpetuates the Transformer’s versatility in sculpting long-range interdependencies, safeguarding the integrity of spatial information and feature channel correlations across vast expanses, thereby amplifying its global information assimilation ability. This has opened up new perspectives for addressing the segmentation of fragmented and eroded markings. Transformer-based image segmentation techniques, through their self-attention mechanisms, deftly navigate interrelations across diverse image locales, supporting holistic contextual comprehension [21]. The authors have also introduced M-SKSNet [23], a product of Transformer-based segmentation, which has achieved commendable results. However, its intricate parameterization and substantial computational demands, especially when dealing with expansive imagery, could lead to complex and prolonged training phases.

The literature confirms that an extensive receptive field is crucial for capturing long-range correlations [24,25,26,27]. Markings spoiled by fissures and flaking necessitate a panoramic contextual grasp across assorted scales. In response, this study has developed a multi-scale dynamic selection convolution module, utilizing convolutional kernels of various dimensions and dilation rates to dynamically adjust the receptive field within the feature extraction core. This strategic adjustment is crucial for effectively handling the diverse extents of the subjects under scrutiny, thereby enhancing the precision and robustness of the segmentation process.

2.2. Knowledge Distillation Research Progress

Knowledge distillation (KD) has emerged as a crucial technique for model compression, enabling the transfer of knowledge from a large teacher model to a smaller student model. This process significantly enhances the student model’s performance and efficiency without altering its architecture [28,29]. Compared to other model compression techniques such as pruning [30,31], quantization [32], and low-rank approximation [33,34,35], KD offers distinct advantages by preserving the model’s structure.

In the field of computer vision, KD has seen significant advancements. Gou et al. [36] proposed a fine-grained distillation method that enhances the student’s ability to learn from the internal representations of the teacher model by matching feature maps at different levels. This approach has shown considerable performance improvements. Similarly, Touvron et al. [37] introduced a layer-wise distillation method in their Data-efficient Image Transformers (DeiT), which allows the student model to learn rich feature representations at each layer, particularly benefiting data-scarce environments. Furthermore, KD has been effectively combined with contrastive learning to improve the generalization and robustness of student models. Xie et al. [38] developed a contrastive distillation method that incorporates a contrastive loss, encouraging the student model to learn feature representations similar to those of the teacher model. This not only enhances performance in image recognition tasks but also improves resilience to noise and data distribution shifts. The integration of self-supervised learning (SSL) with KD has also gained attention. He et al. [39] developed a self-supervised KD method, where the teacher model is pre-trained using self-supervised techniques, followed by distillation to the student model. This method has demonstrated significant performance improvements in image recognition tasks, even with limited labelled data. In practical applications, KD proves advantageous not only in enhancing model performance but also in significantly reducing computational complexity and storage requirements.

In summary, KD leverages various strategies such as output layer distillation [40], intermediate layer distillation [41], self-distillation [42], adversarial distillation [43], and multi-teacher distillation to effectively enhance the accuracy and efficiency of student models. Despite the increased training time and cost associated with pre-training a large teacher model, the benefits in terms of improved performance and reduced computational demands make KD an essential technique for model compression.

2.3. Proposed Improvements for Road Damage Marking Detection Models

Given the complexity and computational demands of high-precision road damage detection models, and considering specific challenges such as fragmented, densely packed, and blurred edge features, we propose the following improvements:

(1): Multi-Level Distillation Strategy: To overcome the limitations of the output layer and intermediate layer distillation, we introduce a multi-level distillation strategy. By capturing knowledge from various levels of the teacher network—encompassing both low-level features and high-level semantic information—we aim to enhance overall performance.
(2): Lightweight Model Design: To alleviate the intricacies of encoder–decoder architectures, we adopt ResNet18 as the student model and ResNet101 as the teacher model. Through distillation learning, knowledge transfer occurs, enabling the student model to achieve comparable segmentation accuracy with fewer parameters.
(3): Multi-Scale Dynamic Selection Convolution Module: Departing from Transformer-based image segmentation methods, which often demand substantial parameters and computational resources, we propose a multi-scale dynamic selection convolution module. This module adeptly captures multi-scale features and adaptively exploits spatial context information, thereby compensating for any context information deficiencies.
(4): Adaptive Spatial and Channel Attention Module: Our study introduces an adaptive spatial and channel attention module, simultaneously emphasizing spatial details and channel-specific features within images. By embedding this attention mechanism within skip connections and the decoder, we seamlessly fuse features from different levels, thereby facilitating superior image detail restoration.

3. Methods

3.1. Lightweight Model Design Framework

Figure 1 depicts the framework for designing lightweight models. To address the specific characteristics of damaged road markings—such as fractures, dense microstructures, and fuzzy edges—we purposefully construct the teacher model employed in our study. The teacher model utilizes ResNet101 as its backbone network. Subsequently, we adopt ResNet18 as the backbone network for constructing the student model, ensuring that both models share identical network architectures. Finally, we employ a multi-level distillation strategy for model compression. This strategy transfers feature knowledge from the intermediate layers of the teacher model and category knowledge from the output layer to the student model. Through parameter optimization, the student model achieves performance comparable to that of the teacher model.

3.2. Teacher–Student Model Architecture

The architecture of our proposed teacher–student model is depicted in Figure 2. We have chosen ResNet101 as the backbone network for the teacher model (referred to as MALNet-101) and ResNet18 as the backbone network for the student model. The teacher model’s primary objective is to learn both the spectral and geometric characteristics of damaged road markings within images and generate feature maps across various hierarchical levels.

This study introduces a decoder equipped with two novel modules: the Multi-scale Dynamic Selection Convolution (MDSC) and the Adaptive Spatial and Channel Attention (ASCA). The MDSC module achieves multi-scale feature fusion without introducing additional parameters or computational overhead. By dynamically selecting relevant convolutional features, it enhances both segmentation accuracy and completeness. This innovative approach ensures that our model effectively captures context at different scales, contributing to robust road marking segmentation. The ASCA module operates by adaptively adjusting the spatial and channel weights of feature maps. This dynamic attention mechanism enhances feature expression and discriminability. By selectively emphasizing informative regions and suppressing noise, ASCA contributes significantly to the overall segmentation performance.

3.2.1. Encoder

The primary role of the encoder is to learn spectral and geometric feature information from damaged road markings in images and generate feature maps at different levels. For this study, we have chosen a ResNet-based backbone network as the feature extractor. ResNet, a type of residual network, addresses the issues of gradient vanishing and overfitting in deep networks by introducing residual connections. The fundamental building block of ResNet is the residual block, each containing two or three convolutional layers and a skip connection. This skip connection directly adds the input to the output of the convolutional layers. The encoder’s purpose is to extract high-level semantic features from input images, forming the foundation for subsequent road marking segmentation tasks.

3.2.2. Multi-Scale Dynamic Selection Convolution Module (Decoder)

The decoder’s role is to upsample and optimize the features extracted by the backbone network, resulting in a segmentation map of the same size as the input image. The Multi-scale Dynamic Selection Convolution (MDSC) module, designed in this study for feature extraction, dynamically adjusts the receptive field of the feature extraction backbone. By effectively handling different ranges of detected objects, MDSC enhances segmentation accuracy and robustness (Figure 3). The core idea behind MDSC involves using convolutional kernels of varying sizes and dilation rates, dynamically selecting the optimal combination of kernels to adaptively adjust the receptive field.

The MDSC comprises several sub-convolution kernels, each with varying sizes and dilation rates. By utilizing 3 × 3, 5 × 5, and 7 × 7 sub-kernels, we obtain a diverse set of 9 convolution kernels. Independently, these sub-kernels convolve input features, resulting in individual sub-feature maps. These maps are then combined to create a multi-scale dilated feature map, capturing features across different scales and receptive fields. However, not all sub-convolution kernels contribute equally to the segmentation task; some may introduce redundancy or noise, affecting feature quality and efficiency.

To address this challenge, we propose a dynamic selection mechanism. It automatically adapts the receptive field by choosing the most effective combination of sub-convolution kernels based on input content and task requirements. Our approach employs a fully connected layer as a selector, assigning weights to each sub-kernel to signify its relevance for segmentation. These weights, activated using a sigmoid function, guide the combination of sub-feature maps, resulting in a weighted multi-scale dilated feature map. Remarkably, this dynamic adjustment occurs without introducing additional parameters. Additionally, our spatial selection mechanism refines feature mapping by considering large convolution kernels of varying scales. The formulation is expressed as Equation (1):

F_{M L K C} = \sum_{i = 1}^{9} w_{i} F_{i}

(1)

F_{M L K C}

represents the multi-scale dilated feature map,

w_{i}

denotes the weight associated with the i-th sub-convolution kernel, and

F_{i}

corresponds to the sub-feature map produced by the i-th kernel. Compared to conventional convolution kernels, our multi-scale dilated approach enhances segmentation performance and robustness by capturing richer feature information across broader receptive fields.

3.2.3. Adaptive Spatial and Channel Attention Module

This study introduces an Adaptive Spatial and Channel Attention (ASCA) module for feature enhancement, illustrated in Figure 4. The ASCA module enhances the model’s ability to capture and generalize important information by adaptively adjusting the weights of multi-level features. By integrating spatial and channel attention mechanisms, the ASCA module enhances features in both spatial and channel dimensions.

The channel attention mechanism assigns weights to each channel to emphasize its importance. This involves global average pooling on the input feature map to create a feature descriptor, followed by convolution to generate a channel attention map, which is normalized using a Sigmoid function. The original feature map is then element-wise multiplied with this channel attention map to adjust the weight of each channel.

The spatial attention mechanism assigns weights to each spatial position. It starts with average and max pooling on the input feature map along the channel dimension to generate two spatial descriptors. These descriptors are concatenated to form a fused feature map, which undergoes a 7 × 7 convolution to produce a spatial attention map, normalized by a Sigmoid function. The channel attention-enhanced feature map is then element-wise multiplied with this spatial attention map to adjust the weight of each spatial position.

Specifically, we term these mechanisms Adaptive Spatial Attention (ASA) and Adaptive Channel Attention (ACA). Their mathematical formulations are as Equations (2) and (3):

F_{A S A} = F ⨀ σ (W_{s} * F)

(2)

F_{A C A} = F ⨀ σ (W_{C} * F)

(3)

F denotes the input feature, while

F_{A S A}

and

F_{A C A}

represent features enriched through spatial and channel attention, respectively. The Hadamard product ⊙ performs element-wise multiplication. Spatial attention weights S and channel attention weights C lie within the range of 0 to 1.

σ

is the sigmoid function.

W_{s}

and

W_{C}

correspond to spatial attention and channel attention convolution kernels, respectively. The spatial attention convolution kernel captures the spatial distribution of features, while the channel attention convolution kernel captures channel correlations. These attention weights, ranging from 0 to 1, adaptively adjust the importance of each pixel or channel based on the input features and task requirements, achieving adaptive feature enhancement. By incorporating the ASCA module, this study’s model more effectively captures and enhances critical features, significantly boosting the model’s representational and generalization performance.

3.3. Multi-Level Distillation Strategy

Our study introduces a Multi-Level Distillation Strategy (MLDS) (depicted in Figure 5) to transfer knowledge from the teacher model, MALNet-101, to the student model, MALNet-18. The objective is twofold. Enhance the student model’s performance while maintaining computational efficiency, resulting in a lightweight model. MLDS leverages the teacher model’s insights to support the student model’s generalization and robustness. It comprises two key knowledge distillation components: feature distillation and segmentation distillation. Feature distillation involves using the teacher model’s intermediate feature maps as auxiliary supervision, guiding the student model toward richer feature representations. Simultaneously, segmentation distillation employs the teacher model’s segmentation maps to enhance the student model’s segmentation accuracy.

Both student and teacher models share identical architectures, featuring four feature layers and one classification layer. The teacher model’s intermediate feature layers contain valuable structural and semantic information, particularly crucial for fine-grained segmentation tasks like damaged road marking detection. Consequently, we align features from the teacher model’s intermediate layers with those from the student model, calculating a feature knowledge distillation loss function to boost the student model’s expressiveness. Furthermore, we align the outputs of the teacher model’s output layer with those of the student model, computing a category knowledge distillation loss function to enhance the student model’s classification capabilities. Additionally, we compare the student model’s output layer results with ground truth labels, calculating a category loss function to improve segmentation performance. The overall loss function for our multi-level distillation strategy is expressed in Equation (4):

L_{t o t a l} = α * L_{K} + {β * L}_{F} + L_{C}

(4)

α

and

β

are initial hyperparameters controlling the weights of different loss functions in knowledge distillation.

L_{K}

represents the category of knowledge distillation loss,

L_{F}

denotes the feature knowledge distillation loss, and

L_{C}

corresponds to the category loss. These values can be adjusted based on experimental results. Generally, larger

α

and

β

values indicate a greater influence on the respective loss functions, while smaller values indicate the opposite. This study emphasizes the significance of feature loss. Specifically, we set

α

and

β

to 0.01 and 0.05, respectively. These values prioritize category knowledge distillation loss (Equation (5)) and feature knowledge distillation loss (Equation (6)), while assigning relatively less weight to category loss (Equation (7)). The specific computational formulas are as follows:

L_{K} = \sum_{i = 1}^{N} \frac{1}{C_{i} H_{i} W_{i}} H (S_{i}^{s}, S_{i}^{t}) * T^{2}

(5)

L_{F} = \sum_{i = 1}^{N} \frac{1}{C_{i} H_{i} W_{i}} ∥ F_{i}^{s} - F_{i}^{t} ∥_{2}^{2}

(6)

L_{C} = \sum_{i = 1}^{N} \frac{1}{C_{i} H_{i} W_{i}} H (S_{i}^{s}, Y_{i})

(7)

N

represents the batch size, and

C_{i} {, H}_{i,} W_{i}

correspond to the channel number, height, and width of the

i

-th sample.

F_{i}^{s}

and

F_{i}^{t}

denote the intermediate features from the student and teacher models, respectively.

S_{i}^{s}

and

S_{i}^{t}

represent the output probability distributions of the student and teacher models for the

i

-th sample, while

Y_{i}

is the true label. The squared L2 norm is denoted by

∥ . ∥_{2}^{2}

. H(⋅,⋅) refers to cross-entropy, and T represents a temperature constant.

The

L_{K}

function quantifies the divergence between output probability distributions of the student and teacher models. We employ cross-entropy as the measurement. To facilitate richer learning from the teacher model’s soft labels, we often adjust the output using a temperature parameter, resulting in a smoother probability distribution that accentuates differences among various classes. We assess the dissimilarity between intermediate features of the student and teacher models. Mean squared error

L_{F}

serves as our metric. By leveraging this approach, the student model gains insights from the teacher model’s intermediate features, ultimately enhancing its generalization capabilities.

L_{C}

evaluates the disparity between the student model’s output and the ground truth labels, again using cross-entropy. We aim to achieve high accuracy in classification tasks. Overall, our multi-level distillation strategy strikes a balance; it ensures model accuracy while reducing computational costs and model size, resulting in a lightweight yet effective model.

3.4. Evaluation Metrics

In this study, five evaluation metrics were used to comprehensively evaluate the road extraction performance of the network model: overall accuracy (OA), precision (P), recall (R), F1 score (F1), and intersection over union (IOU). For image segmentation, the prediction result and the actual label can form four situations: TP (predicted as true, labeled as true), TN (predicted as false, labeled as false), FP (predicted as true, labeled as false), and FN (predicted as false, labeled as true). The specific calculations of different evaluation metrics are as follows:

O A = \frac{T P + F N}{F P + T N + T P + F N}

(8)

P = \frac{T P}{T P + F P}

(9)

R = \frac{T P}{T P + F N}

(10)

F 1 = \frac{2 \times P \times R}{P + R}

(11)

I O U = \frac{T P}{F P + T P + F P}

(12)

OA reflects the accuracy of the model prediction result; however, as the image contains a large amount of background area, F1 and IOU are generally used for comprehensive evaluation. The higher the F1 score, the better the model classification prediction result. IOU is the degree of overlap between the damaged road marking prediction area and the real damaged road marking area. The higher the degree of overlap, the better the model performance.

4. Experiment

To assess the efficacy of the proposed method, we conducted experiments on three distinct data sets: CDM_H (Collective Damaged road Marking-Highways), CDM_C (Collective Damaged road Marking-Cityroad), and a road data set derived from the publicly available Apollo Scape data set, which we named CDM_P (Collective Damaged road Marking-Public). In our comparative analysis, we evaluated our approach against several segmentation models, including BiSeNet [44], EaNet [45], LANet [8], MAResUNet [9], LinkNet [10] and ResNeSt [11], all of which employ ResNet18 as their backbone network. Additionally, we benchmarked our method against the latest road damage detection models, namely ConvNeXt [12] and SegFormer [13].

4.1. Experiment Setting

In this study, the Python 3.7 programming language and Pytorch 1.7 DL framework were used to build the model, and all experiments were completed using the Centos 7 system. We conducted 100 training sessions for the model, setting the batch size to 8. The Adam optimizer with a learning rate of 0.0001 was used for parameter iterative optimization, and two NVIDIA RTX 2080Ti GPUs were used to speed up the model training.

Our research team meticulously collected raw data for three distinct data sets: CDM_H, CDM_C and CDM_P (depicted in Figure 6).

CDM_H: This data set encompasses data from Fuzhou, Chongqing, and Wuhan, totaling 35,565 images. These images were captured at an original resolution of 3520 × 1080 pixels. Our data collection efforts spanned approximately 1 h in each city, covering a distance of about 70 km. To minimize visual interference, we opted for nighttime data collection, which significantly improved the visibility of road markings.
CDM_C: Focusing on urban streets in Wuhan and Shanghai, this data set resulted from a one-time data collection effort in each city. The collection period lasted approximately 30 min, covering a total distance of 40 km. In total, we obtained 6838 images.
CDM_P: We sourced raw data for this data set from the publicly available Apollo Scape data set. Captured using VMX-1HA equipment, the Apollo Scape data set showcases urban landscapes in Beijing. It comprises 4274 images, each with a resolution of 1920 × 1080 pixels.

Notably, due to the generally well-maintained condition of highways and urban roads, directly identifying damaged road markings proved challenging. Manual screening was necessary. Additionally, we carefully selected the central region of the near field as the Region of Interest (ROI) for cropping. This strategic choice significantly reduced the impact of trees, buildings, and other elements on road marking detection, ultimately enhancing the efficiency of our algorithm. The resulting cropped images are sized at 512 × 512 pixels.

In this study, the training set and test set were randomly divided according to a ratio of 8:1. The division of data sets into an 8:1 ratio was based on standard practices to ensure a balanced distribution of training and testing data. This ratio allows for comprehensive model training while maintaining enough data for evaluation.

The highway data set (CDM_H) comprises a total of 3080 images, with 2730 images allocated for training purposes and 350 images reserved for testing;
Similarly, the urban road data set (CDM_C) contains 1673 images, out of which 1483 images are designated for training, and 190 images are set aside for testing;
Lastly, the public data set (CDM_P) encompasses 980 images, among which 870 images are utilized for training, and 110 images are allocated for testing.

These subsets provide distinct data sets for training and evaluating deep learning models in specific road scenarios, ensuring comprehensive coverage and accuracy.

4.2. Experimental Results Analysis

To gain a thorough understanding of how well MALNet performs, various segmentation experiments were carried out across the CDM_P, CDM_H, and CDM_C data sets, followed by a detailed analysis of both quantitative and qualitative outcomes.

4.2.1. Quantitative Results and Analysis

To quantitatively analyze the extraction results of the segmentation model, this study used the OA, P, R, F1, and IOU indicators to evaluate the test set results on CDM_P, CDM_H, and CDM_C. The F1 score and IOU reflect the model classification prediction result and the accuracy and completeness of the segmentation; therefore, this study focuses on comparing these two indicators.

Quantitative Analysis of Experimental Results on Different Data Sets

(1): Results of the network performance test on the CDM_P data set

Table 1 presents the results of the network performance test on the CDM_P data set. The CDM_P data set serves as a publicly available resource for damaged road marking detection. Collected primarily during daylight hours, with a few instances during dusk, the data set benefits from favorable lighting conditions. The damaged road markings exhibit excellent contrast with their surroundings. Consequently, the performance of the 11 segmentation models on the CDM_P data set generally surpasses their results on the CDM_H and CDM_C data sets.

Among these models, MALNet demonstrates superior performance on the CDM_P data set, closely followed by the BiSeNet model. This observation underscores the effectiveness of both models in preserving local detail information while capturing global contextual cues. Specifically, they excel in identifying the boundaries of damaged road markings, resulting in more complete segmentation outcomes. MALNet, a lightweight damaged road marking segmentation network proposed in this study, leverages knowledge distillation. It incorporates multi-scale feature fusion and adaptive spatial attention mechanisms to enhance segmentation precision and completeness. By transferring knowledge from a teacher model to a student model through distillation strategies, MALNet achieves improved performance while minimizing computational resources. The BiSeNet model, with F1 and IOU scores reaching 83.99% and 72.39%, respectively, closely trails MALNet. Notably, BiSeNet operates as a bidirectional segmentation network, effectively utilizing spatial and contextual branches to retain spatial information and adapt to diverse scenarios and damage types.

However, the performance of the remaining nine models on the CDM_P data set is relatively modest. EaNet exhibits a high precision (P) value but lower R and IOU values. This discrepancy suggests an overemphasis on positive samples during segmentation, leading to inaccuracies and incompleteness. ConvNeXt and MAResUNet achieve high F1 scores but low IOU values. These models correctly identify most marking regions (high recall) but lack precise boundary matching (low precision), resulting in suboptimal IOU. LANet, which focuses on local attention, notably underperforms compared to other models. This limitation may stem from the dense micro-features inherent in damaged road markings, where fine granularity matters. SegFormer records the lowest F1 and IOU values (68.90% and 52.56%, respectively). The absence of position encoding likely hampers its ability to effectively retain spatial information. Given that damaged markings vary in shape and size across different locations, accurate spatial context is crucial. Position encoding could enhance the model’s understanding of spatial layouts.

(2): Results of the network performance test on the CDM_H data set

The results of the network performance test on the CDM_H data set are shown in Table 2. The CDM_H data set focuses on detecting damaged road markings specifically on highways. Although the types of damage are relatively uniform, the nighttime data collection—while challenging due to limited illumination—provides a unique context. Unfortunately, the reduced lighting conditions contrast with damaged road markings and their surroundings are less pronounced. Nevertheless, the data set comprises 3113 images, allowing models to learn from ample samples. Consequently, the segmentation performance achieved on the CDM_H data set closely approximates that of the publicly available CDM_P data set, which contains 980 images.

Among the evaluated models, MALNet stands out on the CDM_H data set, achieving an impressive F1 score of 83.75% and an IOU (Intersection over Union) of 72.04%. Notably, these metrics significantly surpass those of other segmentation networks. In comparison to similar models, MALNet outperforms the second-ranked LinkNet by 0.85% in F1 and 1.25% in IOU. The success of LinkNet, particularly on the CDM_H data set, can be attributed to its design as a road-specific segmentation network. Leveraging deconvolution and element-wise addition operations, LinkNet achieves feature upsampling and fusion. Interestingly, when considering the student models before knowledge distillation, MALNet-18 performs comparably to BiSeNet, LinkNet, EaNet, and MAResUNet. However, through knowledge distillation, MALNet experiences significant performance gains without altering its model structure. The distilled model outperforms other models in the same category. On the other hand, LANet exhibits the lowest F1 and IOU values (71.59% and 55.75%, respectively). This outcome may be attributed to the dense micro-features inherent in damaged road markings. While LANet enhances feature representation through local attention, it may lack the necessary finesse when handling such intricate details, resulting in compromised segmentation accuracy and completeness.

(3): Results of the network performance test on the CDM_C data set

The network performance results on the CDM_C data set are presented in Table 3. The CDM_C data set serves as a comprehensive urban damaged road marking detection data set, encompassing various scenarios such as lane markings, intersections, main roads, and side roads. The diversity of damage types within this data set poses a unique challenge. However, due to nighttime data collection—characterized by insufficient illumination—the contrast between damaged road markings and their surroundings is less pronounced. Despite these complexities, the CDM_C data set comprises 1718 images, representing different geographical regions across China, including Chongqing, Wuhan, Shanghai, and Beijing.

Compared to the CDM_H highway data set and the publicly available CDM_P data set, the overall performance of the 11 evaluated models on the CDM_C data set is not particularly promising. Several factors contribute to this outcome.

(1): Nighttime Data Collection: The predominantly nighttime data collection introduces low image quality and increased noise. Insufficient lighting diminishes the visibility of damaged road markings, making their differentiation from the surrounding environment challenging.
(2): Spatial Heterogeneity: The diverse sampling locations (Chongqing, Wuhan, Shanghai) introduce spatial heterogeneity. Variations in road materials, colors, and humidity across different regions impact spectral differences, potentially affecting model generalization.
(3): Variety of Road Types: The CDM_C data set covers a wide range of road types, including urban main roads, side streets, and intersections. The multitude of damaged road marking types presents a complex scenario. However, the data set’s sample size relative to the diversity of damage types remains insufficient, limiting the model’s robustness.

Despite these challenges, MALNet consistently achieves optimal results on the CDM_C data set. This underscores MALNet’s ability to leverage multi-scale feature fusion and adaptive spatial attention mechanisms, effectively addressing segmentation complexities in intricate urban road scenes. Its robustness and generalization capabilities remain noteworthy.

2.: Overall Analysis of Quantitative Experimental Results

The experimental results underscore the superiority of the MALNet series models across all three data sets. These models consistently maintain excellent performance even in the face of diverse damage types and complex backgrounds, demonstrating their robustness and generalization capabilities. Notably, MALNet outperforms the original student model, MALNet-18, across all evaluation metrics, particularly achieving significant improvements in the F1 score and IOU. This enhancement highlights its accuracy and resilience.

From a structural perspective, the MALNet series models consistently achieve optimal performance across all tested data sets, affirming their robust stability and broad applicability. Remarkably, on the publicly available CDM_P data set, MALNet-101 leads the pack with an impressive overall accuracy (OA) of 99.59% and an IOU of 74.54%. This result vividly showcases its exceptional ability to accurately identify correct samples and differentiate between various types of damaged road markings.

Considering the data sets, the CDM_P data set primarily comprises images captured during daylight hours, benefiting from favorable lighting conditions that create an ideal testing environment. Although the CDM_H data set predominantly consists of nighttime images with limited visibility, its substantial sample size still yields results comparable to those of the CDM_P data set, emphasizing the importance of ample samples for effective model learning and adaptation. Meanwhile, the CDM_C data set presents the most challenging conditions, including diverse urban road scenes and insufficient nighttime illumination. Nevertheless, MALNet continues to achieve outstanding performance, showcasing its adaptability in complex scenarios.

Regarding knowledge distillation, MALNet effectively leverages the knowledge from the teacher model, resulting in significant performance gains without altering the model’s structure. This technique not only enhances the student model’s performance but also makes the model lightweight, making it suitable for deployment in resource-constrained environments.

In summary, the MALNet series models stand out due to their balanced performance across various data sets, affirming their effectiveness and reliability in detecting damaged road markings across diverse road scenarios. Their accuracy, completeness, and robustness make them the preferred choice for semantic segmentation tasks under diverse conditions.

4.2.2. Qualitative Results and Analysis

To comprehensively showcase the performance of the MALNet model, we specifically examine three data sets: CDM_P, CDM_H, and CDM_C. By maintaining consistent loss functions and learning rates, we compare MALNet against other methods. Our analysis aims to highlight the advantages of MALNet in handling challenging features present in damaged road marking images.

Qualitative Analysis of Experimental Results on Different Data Sets

(1): Results of the network performance test on the CDM_P data set

Figure 7 illustrates the segmentation results of the CDM_P data set, showcasing commendable performance across all 11 models. This achievement likely stems from the data set’s inherent characteristics. CDM_P was meticulously collected during daylight hours, ensuring favorable illumination conditions. Consequently, the damaged road markings exhibit pronounced contrast with the surrounding environment, minimizing interference and facilitating accurate segmentation.

A comparative analysis of the segmentation outcomes reveals that MALNet consistently stands out. Its segmentation results exhibit clear boundaries and high completeness, surpassing other models. Notably, MALNet incorporates an innovative design—the multi-scale dynamic selection convolution module. This novel approach automatically adapts sub-convolution kernels based on input feature content and task demands, dynamically adjusting the receptive field. Remarkably, MALNet achieves broader decoding context information without introducing additional parameters, effectively enhancing both segmentation accuracy and completeness.

(2): Results of the network performance test on the CDM_H data set

Figure 8 showcases the segmentation results of the CDM_H data set. Notably, the depicted road markings represent typical examples of blurred markings—where the overall contour remains discernible, but the details have significantly deteriorated due to wear and tear.

Remarkably, MALNet’s segmentation results closely approach those of the teacher model, MALNet-101, successfully extracting clear and complete damaged road marking contours. This achievement can be attributed to MALNet’s innovative use of the multi-scale dynamic selection convolution module. This novel design allows MALNet to automatically adapt sub-convolution kernels based on input feature content and task requirements, dynamically adjusting the receptive field. Consequently, MALNet effectively captures a broader decoding context without introducing additional parameters, thereby enhancing segmentation accuracy and completeness.

However, BiSeNet’s segmentation results, while relatively complete, exhibit an error in the lower right corner, misclassifying non-road marking areas as damaged road markings. This discrepancy may arise from BiSeNet’s underutilization of spatial information, leading to insufficient differentiation between road markings and the background. Although BiSeNet excels in rapid inference speed, further improvements are necessary to handle complex scenarios involving blurred road markings.

Additionally, LinkNet and EaNet achieve partial segmentation in limited regions, resulting in missing segments. These models demonstrate limitations when dealing with blurred road markings. While LinkNet prioritizes lightweight and efficient design, its performance in complex scenarios requires enhancement. EaNet, on the other hand, should better leverage spatial information to improve segmentation outcomes.

Furthermore, SegFormer, MAResUNet, and LANet only partially segment the damaged road markings, failing to capture their complete details. SegFormer, characterized by its use of Transformer structures, has room for improvement in detail representation. As for MAResUNet and LANet, their designs need to better address complex damage scenarios.

(3): Results of the network performance test on the CDM_C data set

Figure 9 illustrates typical examples of urban road marking wear. Unlike highways, urban road conditions vary significantly due to inadequate maintenance and frequent accidents. Consequently, the overall wear on urban roads is more severe than that observed on highways. In the example, the road markings have almost worn away, leaving only faint contours visible.

Remarkably, the MALNet series models continue to demonstrate outstanding segmentation performance, successfully extracting relatively complete contours of the damaged road markings. MALNet’s success can be attributed to its innovative use of the adaptive spatial and channel attention modules. These modules enhance the feature expression capacity of the segmentation branches, enabling the model to better capture spatial information related to road damage and markings. Additionally, MALNet’s adaptive fusion module dynamically adjusts the output weights of the segmentation branches, automatically selecting more suitable fusion strategies based on input images. This adaptability significantly improves segmentation robustness.

However, BiSeNet’s segmentation results, while relatively complete, exhibit an error in the lower right corner, misclassifying non-road marking areas as damaged road markings. This discrepancy may arise from BiSeNet’s underutilization of spatial information when dealing with geometrically complex and irregularly edged road markings, leading to insufficient differentiation between damaged markings and complex backgrounds. Moreover, LinkNet, LANet, ConvNeXt, and SegFormer achieve segmentation only in limited regions of the urban damaged road markings, resulting in missing segments. These models demonstrate limitations when handling highly worn and blurred markings. While LinkNet emphasizes lightweight and efficient design, its performance in complex scenarios requires improvement. EaNet, on the other hand, achieves relatively complete segmentation results, benefiting from its external attention mechanism. This mechanism utilizes two small, learnable, shared-memory units to effectively extract image features. Notably, this approach achieves spatially effective image segmentation, particularly when dealing with severely worn road markings.

2.: Overall Analysis of Qualitative Experimental Results

The experimental results demonstrate that the MALNet series models consistently outperform other methods across all three data sets. This performance underscores their effectiveness in handling diverse damage types and complex backgrounds. MALNet excels in segmenting all types of damaged road markings, exhibiting superior accuracy and completeness compared to other models. These findings highlight MALNet’s robustness and generalization capabilities, enabling it to adapt to different damage types and lighting conditions.

From a structural perspective, the MALNet series models dynamically adjust the receptive field using the multi-scale dynamic selection convolution module. This innovation enhances the precision and completeness of the segmentation branches. Additionally, the adaptive spatial and channel attention modules augment feature expression, allowing MALNet to better capture spatial information related to road damage and markings. Furthermore, the model dynamically adjusts the output weights of the segmentation branches, automatically selecting fusion strategies suitable for different input images. This adaptability significantly improves the overall robustness of the segmentation branches.

Regarding knowledge distillation, MALNet effectively learns from the teacher model, acquiring additional detail expression capacity through multi-level distillation. This process enhances the performance of the student model’s segmentation branches, achieving results comparable to those of the teacher model. Remarkably, MALNet achieves this with fewer parameters.

In summary, the MALNet series models stand out due to their balanced performance across various data sets, affirming their effectiveness and reliability in detecting damaged road markings across diverse road scenarios. Compared to the original student model, MALNet demonstrates accuracy, completeness, and robustness, making it a powerful segmentation model suitable for a wide range of practical applications.

4.3. Ablation Study

To evaluate the performance and advantages of MALNet, we conducted an ablation study on the CDM-P, CDM-H, and CDM-C data sets. We used a Baseline model and incrementally added each module to verify the effectiveness and contribution of each key component in the proposed model. The ASCA module aims to enhance feature extraction accuracy through adaptive channel attention, while the MDSC module improves the model’s ability to recognize damaged road markings at different scales through multi-scale dynamic kernel selection.

Table 4 shows the results of the ablation study for different modules. From Table 4, it is evident that adding the ASCA module results in the highest F1 and IOU improvements of 3.14% and 4.49%, respectively, compared to the Baseline model. This demonstrates the effectiveness of the ASCA module in extracting road damage markings. Further analysis reveals that the ASCA module achieves more significant improvements on the CDM-P and CDM-H data sets, indicating its advantage in handling complex scenarios.

In terms of the F1 score, the addition of the MDSC module raises the Baseline from 80.24% to 83.04%, and for IOU, from 67.01% to 72.04%. This suggests that the MDSC module enhances the recognition accuracy of damaged markings through multi-scale feature fusion. The MDSC module’s impact is particularly notable on the CDM-H data set, likely due to this data set’s inclusion of markings with greater scale variability, allowing the MDSC module’s multi-scale handling capabilities to shine.

Overall, the improvement from adding the MDSC module is greater than that from adding the ASCA module, highlighting the more significant role of the multi-scale dynamic selection kernel module in the network. The combination of the MDSC and ASCA modules leads to the highest F1 and IOU improvements of 3.5% and 5.03%, respectively, compared to the Baseline model. This demonstrates the necessity of each module in the proposed model for achieving optimal segmentation results, and their synergy enables accurate segmentation of damaged road markings.

4.4. Feature Map Visualization

In this section, we visualize the model’s extraction results using the CDM_P road data set as an illustrative example. As shown in Figure 10, different brightness levels represent activation values. By modeling local and global contextual relationships, MALNet achieves an accurate interpretation of various damaged road marking scenarios.

To better understand how MALNet extracts damaged road markings, Figure 10 displays feature representations at five different encoding and decoding stages: (a) original input image, (b) image after the first encoding, (c) image after the second encoding, (d) image after the third encoding, and (e) image after the fourth encoding. As the feature level increases, semantic information becomes increasingly abstract, and the boundaries of damaged road markings become fuzzier. (f) Image after the first decoding, (g) image after the second decoding, (h) image after the third decoding, and (i) image after the fourth decoding (j) show the extracted results of damaged road markings after semantic segmentation. The decoding stages reveal that the inclusion of the ASCA and MDSC modules results in clearer feature boundaries for damaged road markings, effectively reducing redundancy. Overall, this underscores that the proposed MALNet exhibits robustness and reasoning capabilities.

4.5. Knowledge Distillation Enhancement Analysis

MALNet employs a multi-level knowledge distillation strategy, transferring various levels of knowledge (including feature maps, segmentation maps, and classification results) from the teacher model to the student model. This approach enhances the student model’s segmentation performance. By leveraging the advantages of the teacher model while reducing the student model’s parameter count and computational load, MALNet achieves an efficient and practical segmentation network that strikes a balance between accuracy and speed.

Regarding the effectiveness of knowledge distillation (as shown in Table 5), after distillation, MALNet exhibits improved accuracy across different data sets. On the CDM_P data set, F1 and IOU improve by 0.41% and 0.61%, respectively. On the CDM_C data set, F1 increases by 1.55%, and IOU by 1.88%. On the CDM_H data set, the F1 score improves by 2.1%, and IOU by 3.05%. Notably, the most significant improvement occurs in the CDM_H data set, likely due to its larger sample size compared to CDM_C and CDM_P, allowing for more comprehensive learning and better feature recognition during distillation.

4.6. Model Lightweight Analysis

In this section, we delve into the intricate details of model parameters, encompassing their count (params), computational complexity (GFLOPs), and throughput. To estimate the inference time, we employ a consistent input size of 512 × 512, with a batch size of 8, averaging the results across 1000 model executions. The summarized findings are presented in Table 6.

Notably, when comparing MALNet to its predecessor, MALNet-101, we observe remarkable reductions in both parameter count (31.78%) and computational complexity (27.40%). Moreover, the throughput of MALNet surpasses that of MALNet-101 by a factor of 1.9, signifying a substantial enhancement in prediction inference speed. These advancements underscore the pronounced strides made in lightweight model design.

4.7. Limitations of Our Study

The design philosophy behind the MALNet model involves capturing multiscale features by employing sub-convolutional kernels of varying sizes and dilation rates. This approach enhances the model’s ability to recognize objects at different scales. Additionally, the model dynamically adjusts spatial and channel weights within feature maps to enhance feature representation. However, this design may introduce certain challenges.

Firstly, while using sub-convolutional kernels of different sizes and dilation rates increases feature diversity, not all scale-specific features contribute significantly to the final segmentation task. Some sub-kernels may introduce redundant information or noise, thereby compromising feature quality. Therefore, a more refined consideration of each sub-kernel’s contribution to the segmentation task is necessary to avoid unnecessary computational overhead.

Secondly, the adaptive mechanism employed by the MALNet model to adjust feature map weights enhances feature expression by focusing on critical features. However, this adaptability could potentially lead to an overreliance on specific feature distributions present in the training data, affecting the model’s generalization capability. If the training data lacks diversity or exhibits bias, the performance of the MALNet model may be adversely affected.

Lastly, as a deep learning-based model, MALNet demands substantial annotated data and computational resources. These requirements may limit its practical applicability and deployment in real-world scenarios.

5. Conclusions

Our research tackles the challenges of detecting and segmenting road surface markings by proposing MALNet, a lightweight segmentation network utilizing multi-level knowledge distillation. The core components of MALNet include the MDSC modules, which adaptively capture features at different scales, and the ASCA fusion module, which dynamically adjusts feature map weights to enhance robustness in complex scenarios. By transferring knowledge from a teacher model to a student model through MLDS, MALNet improves segmentation performance while significantly reducing model parameters and computational complexity, resulting in an efficient and practical segmentation solution.

Across three distinct data sets for damaged road marking detection, MALNet demonstrates superior segmentation capabilities, achieving more accurate and complete results compared to other models. MALNet offers substantial improvements in parameter count, computational complexity, and throughput. After distillation, the student model’s parameters and computational load are only 31.78% and 27.40% of the teacher model’s, respectively, with a processing speed increase of 1.9 times. This efficiency enables MALNet to be deployed on resource-constrained mobile or embedded systems.

However, MALNet has limitations. It operates at the pixel level, providing only the position and shape of damaged road markings, lacking additional semantic information such as the type, severity, and cause of damage. Furthermore, as a deep learning-based model, MALNet heavily relies on annotated data. To enhance generalization, further research into unsupervised domain adaptation methods is necessary to address domain shifts between training and target data, improving usability and efficiency in real-world scenarios.

Author Contributions

Conceptualization, J.W. and X.L.; methodology, J.W.; software, X.Z.; validation, J.W.; formal analysis, J.W.; investigation, J.W.; writing—original draft preparation, J.W.; writing—review and editing, X.R., D.W., Y.W., P.P. and W.Q.; visualization, J.W.; supervision, X.R. and D.W.; funding acquisition, X.L. and P.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partly supported by Fujian Provincial Major Science and Technology Project- Key technology of Intelligent Inspection of Highway UAV Network by Remote Sensing (Grant No. 2022HZ022002), the Third Xinjiang Scientific Expedition Program (Grant No. 2021xjkk1402), the National Key R&D Program of China (Project No. 2022YFC3800700), the Strategic Priority Research Program of Chinese Academy of Sciences (Grant No. XDA28050200) and Research project of China National Railway Group (Contract No. K2023X002).

Data Availability Statement

Data are available in a publicly accessible repository. The data presented in this study are openly available at https://www.scidb.cn/, accessed on 28 March 2024.

Acknowledgments

The authors thank the anonymous reviewers for their careful reading of our manuscript and their many insightful comments and suggestions.

Conflicts of Interest

Author Peifen Pan was employed by the company China Academy of Railway Sciences Group Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Morrissett, A. Sherif Abdelwahed A Review of Non-Lane Road Marking Detection and Recognition. In Proceedings of the 2020 IEEE 23rd International Conference on Intelligent Transportation Systems (ITSC), Rhodes, Greece, 20–23 September 2020. [Google Scholar] [CrossRef]
Ding, L.; Zhang, H.; Xiao, J.; Li, B.; Lu, S.; Klette, R.; Norouzifard, M.; Xu, F.; Xu, F. A Comprehensive Approach for Road Marking Detection and Recognition. Multimed. Tools Appl. 2020, 79, 17193–17210. [Google Scholar] [CrossRef]
Lyu, X.; Li, X.; Dang, D.; Dou, H.; Wang, K.; Lou, A. Unmanned Aerial Vehicle (UAV) Remote Sensing in Grassland Ecosystem Monitoring: A Systematic Review. Remote Sens. 2022, 14, 1096. [Google Scholar] [CrossRef]
Liu, J.; Liao, X.; Ye, H.; Yue, H.; Wang, Y.; Tan, X.; Wang, D. UAV Swarm Scheduling Method for Remote Sensing Observations during Emergency Scenarios. Remote Sens. 2022, 14, 1406. [Google Scholar] [CrossRef]
Wang, Z.; Wang, J.; Yang, K.; Wang, L.; Su, F.; Chen, X. Semantic Segmentation of High-Resolution Remote Sensing Images Based on a Class Feature Attention Mechanism Fused with Deeplabv3+. Comput. Geosci. 2022, 18, 1049–1069. [Google Scholar] [CrossRef]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef] [PubMed]
Wang, Z.; Li, J.; Tan, Z.; Liu, X.; Li, M. Swin-UperNet: A Semantic Segmentation Model for Mangroves and Spartina Alterniflora Loisel Based on UperNet. Electronics 2023, 12, 1111. [Google Scholar] [CrossRef]
Ding, L.; Tang, H.; Bruzzone, L. LANet: Local Attention Embedding to Improve the Semantic Segmentation of Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2021, 59, 426–435. [Google Scholar] [CrossRef]
Li, R.; Zheng, S.; Duan, C.; Su, J.; Zhang, C. Multistage Attention ResU-Net for Semantic Segmentation of Fine-Resolution Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
Chaurasia, A.; Culurciello, E. LinkNet: Exploiting Encoder Representations for Efficient Semantic Segmentation. In Proceedings of the 2017 IEEE Visual Communications and Image Processing (VCIP), St. Petersburg, FL, USA, 10–13 December 2017; pp. 1–4. [Google Scholar]
Zhang, H.; Wu, C.; Zhang, Z.; Zhu, Y.; Lin, H.; Zhang, Z.; Sun, Y.; He, T.; Mueller, J.; Manmatha, R.; et al. ResNeSt: Split-Attention Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, New Orleans, LA, USA, 19–20 June 2022; pp. 2736–2746. [Google Scholar]
Liu, Z.; Mao, H.; Wu, C.-Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar] [CrossRef]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Álvarez, J.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. In Advances in Neural Information Processing Systems 34; The MIT Press: Cambridge, MA, USA, 2021; pp. 12077–12090. [Google Scholar]
Si, C.; Yu, W.; Zhou, P.; Zhou, Y.; Wang, X.; Yan, S. Inception Transformer. Adv. Neural Inf. Process. Syst. 2022, 35, 23495–23509. [Google Scholar]
Chollet, F. Xception: Deep Learning With Depthwise Separable Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
VGG-ICNN: A Lightweight CNN Model for Crop Disease Identification|Multimedia Tools and Applications. Available online: https://link.springer.com/article/10.1007/s11042-022-13144-z (accessed on 6 February 2024).
A CNN Based Approach for the Point-Light Photometric Stereo Problem|International Journal of Computer Vision. Available online: https://link.springer.com/article/10.1007/s11263-022-01689-3 (accessed on 6 February 2024).
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Cham, Switzerland, 2015. [Google Scholar] [CrossRef]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
Chen, L.-C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: New York, NY, USA, 2017; Volume 30. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Remote Sensing|Free Full-Text|M-SKSNet: Multi-Scale Spatial Kernel Selection for Image Segmentation of Damaged Road Markings. Available online: https://www.mdpi.com/2072-4292/16/9/1476 (accessed on 25 April 2024).
Belal, M.M.; Sundaram, D.M. Global-Local Attention-Based Butterfly Vision Transformer for Visualization-Based Malware Classification. IEEE Access 2023, 11, 69337–69355. [Google Scholar] [CrossRef]
Geng, S.; Zhu, Z.; Wang, Z.; Dan, Y.; Li, H. LW-ViT: The Lightweight Vision Transformer Model Applied in Offline Handwritten Chinese Character Recognition. Electronics 2023, 12, 1693. [Google Scholar] [CrossRef]
Aim, D.; Kim, H.J.; Kim, S.; Ko, B.C. IEEE Shift-ViT: Siamese Vision Transformer Using Shifted Branches. In Proceedings of the 2022 37th International Technical Conference on Circuits/Systems, Computers and Communications (ITC-CSCC), Phuket, Thailand, 5–8 July 2022; pp. 259–261. [Google Scholar]
Brandizzi, N.; Fanti, A.; Gallotta, R.; Russo, S.; Iocchi, L.; Nardi, D.; Napoli, C. Unsupervised Pose Estimation by Means of an Innovative Vision Transformer; Springer International Publishing: Cham, Switzerland, 2022; pp. 3–20. [Google Scholar]
Li, X.; Shen, Q. A Hybrid Framework Based on Knowledge Distillation for Explainable Disease Diagnosis. Expert Syst. Appl. 2024, 238, 121844. [Google Scholar] [CrossRef]
Wu, P.; Wang, Z.; Li, H.; Zeng, N. KD-PAR: A Knowledge Distillation-Based Pedestrian Attribute Recognition Model with Multi-Label Mixed Feature Learning Network. Expert Syst. Appl. 2024, 237, 121305. [Google Scholar] [CrossRef]
Chi, H.; Luo, D.; Wang, S. LMDFusion: A Lightweight Infrared and Visible Image Fusion Network for Substation Equipment Based on Mask and Residual Dense Connection. Infrared Phys. Technol. 2024, 138, 105218. [Google Scholar] [CrossRef]
Lightweight Transformer Image Feature Extraction Network [PeerJ]. Available online: https://peerj.com/articles/cs-1755/ (accessed on 21 February 2024).
Qin, H.; Zhang, Y.; Ding, Y.; Liu, Y.; Liu, X.; Danelljan, M.; Yu, F. QuantSR: Accurate Low-Bit Quantization for Efficient Image Super-Resolution. In Advances in Neural Information Processing Systems 36; NeurIPS; The MIT Press: Cambridge, MA, USA, 2024. [Google Scholar]
Chang, C.-C.; Sung, Y.-Y.; Yu, S.; Huang, N.-C.; Marculescu, D.; Wu, K.-C. FLORA: Fine-Grained Low-Rank Architecture Search for Vision Transformer. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 2482–2491. [Google Scholar]
Savostianova, D.; Zangrando, E.; Ceruti, G.; Tudisco, F. Robust Low-Rank Training via Approximate Orthonormal Constraints. In Advances in Neural Information Processing Systems 36; NeurIPS; The MIT Press: Cambridge, MA, USA, 2024. [Google Scholar]
Yu, Z.; Bouganis, C.-S. SVD-NAS: Coupling Low-Rank Approximation and Neural Architecture Search. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 2–7 January 2023; pp. 1503–1512. [Google Scholar]
Gou, J.; Yu, B.; Maybank, S.J.; Tao, D. Knowledge Distillation: A Survey. Int. J. Comput. Vis. 2021, 129, 1789–1819. [Google Scholar] [CrossRef]
Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jegou, H. Training Data-Efficient Image Transformers & Distillation through Attention. In Proceedings of the 38th International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 10347–10357. [Google Scholar]
Xie, S.; Girshick, R.; Dollar, P.; Tu, Z.; He, K. Aggregated Residual Transformations for Deep Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1492–1500. [Google Scholar]
He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum Contrast for Unsupervised Visual Representation Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 9729–9738. [Google Scholar]
Xu, Q.; Li, Y.; Shen, J.; Liu, J.K.; Tang, H.; Pan, G. Constructing Deep Spiking Neural Networks From Artificial Neural Networks With Knowledge Distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 7886–7895. [Google Scholar]
Gou, J.; Sun, L.; Yu, B.; Wan, S.; Tao, D. Hierarchical Multi-Attention Transfer for Knowledge Distillation. ACM Trans. Multimed. Comput. Commun. Appl. 2023, 20, 1–20. [Google Scholar] [CrossRef]
Lebailly, T.; Stegmüller, T.; Bozorgtabar, B.; Thiran, J.-P.; Tuytelaars, T. Adaptive Similarity Bootstrapping for Self-Distillation Based Representation Learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 16505–16514. [Google Scholar]
Electrohydrodynamic Atomization of CNT on PTFE Membrane for Scaling Resistant Membranes in Membrane Distillation|Npj Clean Water. Available online: https://www.nature.com/articles/s41545-023-00229-x (accessed on 21 February 2024).
Yu, C.; Wang, J.; Peng, C.; Gao, C.; Yu, G.; Sang, N. BiSeNet: Bilateral Segmentation Network for Real-Time Semantic Segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 325–341. [Google Scholar]
Yang, G.; Zhang, Q.; Zhang, G. EANet: Edge-Aware Network for the Extraction of Buildings from Aerial Images. Remote Sens. 2020, 12, 2161. [Google Scholar] [CrossRef]

Figure 1. Lightweight Model Design Framework.

Figure 2. Teacher–Student Model Architecture.

Figure 3. Multi-scale Dynamic Selection Convolution Module. ‘C’ denotes channels. ‘Tensor(6C)’ indicates feature maps for 6 channels. ‘Weight(3C)’ indicates weights for 3 channels.

Figure 4. Adaptive Spatial and Channel Attention Module.

Figure 5. Multi-Level Distillation Strategy.

Figure 6. Data Set Samples.

Figure 7. Qualitative comparison of the extraction results of the CDM_P data set.

Figure 8. Qualitative comparison of the extraction results of the CDM_H data set.

Figure 9. Qualitative comparison of the extraction results of the CDM_C data set.

Figure 10. Feature Map Visualization.

Table 1. Extraction results from the CDM_P data set (%).

Model	OA	R	P	F1	IOU
MALNet-101	99.60	82.30	89.69	85.83	75.18
MALNet	99.59	80.92	90.44	85.42	74.54
MALNet-18	99.58	80.78	89.70	85.01	73.93
BiSeNet	99.55	79.54	88.96	83.99	72.39
LinkNet	99.5	78.19	89.80	83.60	71.82
EaNet	99.54	76.29	91.53	83.22	71.27
MAResUNet	99.47	73.10	89.51	80.48	67.34
LANet	99.28	65.60	82.19	72.96	57.43
ResNeSt	99.53	77.39	89.30	82.92	70.82
ConvNeXt	99.45	76.02	85.36	80.42	67.25
SegFormer	99.22	57.96	84.93	68.90	52.56

Table 2. Extraction results from the CDM_H data set (%).

Model	OA	R	P	F1	IOU
MALNet-101	99.58	83.86	84.93	84.39	73.00
MALNet	99.57	82.41	85.12	83.75	72.04
MALNet-18	99.54	75.39	89.05	81.65	68.99
BiSeNet	99.55	78.82	86.67	82.55	70.29
LinkNet	99.56	78.47	87.85	82.90	70.79
EaNet	99.54	76.44	88.23	81.91	69.36
MAResUNet	99.56	75.99	89.53	82.21	69.79
LANet	99.37	59.02	90.96	71.59	55.75
ResNeSt	99.49	77.03	84.26	80.48	67.34
ConvNeXt	99.50	74.46	86.38	79.98	66.64
SegFormer	99.42	75.86	80.01	77.88	63.78

Table 3. Extraction results from the CDM_C data set (%).

Model	OA	R	P	F1	IOU
MALNet-101	99.08	65.96	81.73	73.00	57.48
MALNet	99.02	67.48	77.66	72.21	56.51
MALNet-18	98.95	67.16	74.55	70.66	54.63
BiSeNet	98.99	62.66	79.13	69.94	53.77
LinkNet	98.77	55.40	72.58	62.84	45.81
EaNet	98.88	65.90	71.91	68.77	52.41
MAResUNet	98.76	57.43	70.96	63.48	46.50
LANet	98.59	38.39	74.42	50.65	33.91
ResNeSt	98.99	62.51	79.11	69.84	53.65
ConvNeXt	98.62	57.34	64.95	60.91	43.79
SegFormer	98.63	51.27	68.15	58.52	41.36

Table 4. Different module ablation experiment results (%).

Data Sets	Baseline	ASCA	MDSC	F1	IOU
CDM-P	√			83.82	72.15
	√	√		84.41	73.03
	√		√	84.33	72.58
	√	√	√	85.42	74.54
CDM-C	√			70.58	54.53
	√	√		72.28	56.60
	√		√	72.21	56.51
	√	√	√	72.88	57.33
CDM-H	√			80.24	67.01
	√	√		83.38	71.50
	√		√	83.04	70.99
	√	√	√	83.75	72.04

Table 5. Comparative analysis of multi-level knowledge distillation enhancement (%).

Data Set	MLDS	F1	IOU
CDM_P	√	85.42	74.54
CDM_P		85.01	73.93
CDM_H	√	83.75	72.04
CDM_H		81.65	68.99
CDM_C	√	72.21	56.51
CDM_C		70.66	54.63

Table 6. Model parameter details.

Model Name	GFLOPs	Params (MB)	Throughput (FPS)-GPU	Throughput (FPS)-CPU
MALNet	14.88	11.89	75.42	3.45
MALNet-101	46.82	43.40	39.67	1.12
BiSeNet	33.55	24.27	72.72	1.69
EaNet	18.76	34.23	78.50	4.76
LANet	9.62	11.25	198.78	7.69
MAResUNet	25.11	16.17	61.66	2.51
LinkNet	17.86	11.53	135.34	3.57
ResNeSt	37.24	18.24	61.32	1.41
ConvNeXt	71.62	46.42	30.93	0.93
SegFormer	13.10	7.71	78.98	1.75

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, J.; Zeng, X.; Wang, Y.; Ren, X.; Wang, D.; Qu, W.; Liao, X.; Pan, P. A Multi-Level Adaptive Lightweight Net for Damaged Road Marking Detection Based on Knowledge Distillation. Remote Sens. 2024, 16, 2593. https://doi.org/10.3390/rs16142593

AMA Style

Wang J, Zeng X, Wang Y, Ren X, Wang D, Qu W, Liao X, Pan P. A Multi-Level Adaptive Lightweight Net for Damaged Road Marking Detection Based on Knowledge Distillation. Remote Sensing. 2024; 16(14):2593. https://doi.org/10.3390/rs16142593

Chicago/Turabian Style

Wang, Junwei, Xiangqiang Zeng, Yong Wang, Xiang Ren, Dongliang Wang, Wenqiu Qu, Xiaohan Liao, and Peifen Pan. 2024. "A Multi-Level Adaptive Lightweight Net for Damaged Road Marking Detection Based on Knowledge Distillation" Remote Sensing 16, no. 14: 2593. https://doi.org/10.3390/rs16142593

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Multi-Level Adaptive Lightweight Net for Damaged Road Marking Detection Based on Knowledge Distillation

Abstract

1. Introduction

2. Related Work

2.1. Deep Learning-Driven Road Marking Segmentation Approaches

2.2. Knowledge Distillation Research Progress

2.3. Proposed Improvements for Road Damage Marking Detection Models

3. Methods

3.1. Lightweight Model Design Framework

3.2. Teacher–Student Model Architecture

3.2.1. Encoder

3.2.2. Multi-Scale Dynamic Selection Convolution Module (Decoder)

3.2.3. Adaptive Spatial and Channel Attention Module

3.3. Multi-Level Distillation Strategy

3.4. Evaluation Metrics

4. Experiment

4.1. Experiment Setting

4.2. Experimental Results Analysis

4.2.1. Quantitative Results and Analysis

4.2.2. Qualitative Results and Analysis

4.3. Ablation Study

4.4. Feature Map Visualization

4.5. Knowledge Distillation Enhancement Analysis

4.6. Model Lightweight Analysis

4.7. Limitations of Our Study

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI