A Lightweight and High-Accuracy Model for Pavement Crack Segmentation

Yu, Yuhui; Xia, Wenjun; Zhao, Zhangyan; He, Bin

doi:10.3390/app142411632

Open AccessArticle

A Lightweight and High-Accuracy Model for Pavement Crack Segmentation

¹

School of Transportation and Logistics Engineering, Wuhan University of Technology, Wuhan 430063, China

²

China Railway Siyuan Survey and Design Group Co., Ltd., Wuhan 430063, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(24), 11632; https://doi.org/10.3390/app142411632

Submission received: 14 October 2024 / Revised: 10 December 2024 / Accepted: 11 December 2024 / Published: 12 December 2024

Download

Browse Figures

Versions Notes

Abstract

:

Pavement cracks significantly affect road safety and longevity, making accurate crack segmentation essential for effective maintenance. Although deep learning methods have demonstrated excellent performance in this task, their large network architectures limit their applicability on resource-constrained devices. To address this challenge, this paper proposes a lightweight, fully convolutional neural network model, enhanced with spatial information. First, the backbone network structure is optimized to improve the efficiency of spatial information utilization. Second, by incorporating adaptive feature reassembly and wavelet transforms, the up-sampling and down-sampling processes are refined, enhancing the model capacity to capture spatial information. Lastly, a dynamic combined loss function is employed during training to further improve model attention on crack edge details. To validate the model performance, we trained and tested it on the Crack500 dataset and applied the trained model directly to the AsphaltCrack300 dataset. Experimental results indicate that the proposed model achieved an MIoU of 80.37% and an F1-score of 78.22% on the Crack500 dataset, representing increases of 3.08% and 5.62%, respectively, compared to EfficientNet. On the AsphaltCrack300 dataset, the model exhibited strong robustness, significantly outperforming other mainstream models. Additionally, its lightweight design provides clear advantages, making it well suited for realworld applications with limited computational resources.

Keywords:

deep learning; pavement cracks; crack segmentation; Haar wavelet transform; feature reorganization; multiple loss functions

1. Introduction

Cracks are one of the primary manifestations of pavement distress, and timely detection and repair of cracks are critical for preventing further deterioration of road conditions and ensuring traffic safety [1,2]. In practical engineering, road inspections involve a large workload, as they cover extensive road networks, and cracks often exhibit variable shapes with irregular edges. This makes efficient and high-precision crack segmentation a significant challenge.

Early methods for road crack detection primarily relied on manual inspections, which were inefficient and highly dependent on the experience of inspectors. This approach could not meet the growing demand for automation in modern road inspections. In recent years, with the rapid development of computer vision technology, high-precision automated crack detection has become feasible [3]. Crack segmentation methods based on computer vision can be broadly categorized into two types: traditional digital image processing methods and deep-learning-based approaches.

Digital image processing methods rely on prior information and manually designed feature-extraction techniques [4,5]. However, their effectiveness is often limited by the experience and domain knowledge of developers, leading to poor generalization and low resistance to interference, making them unsuitable for large-scale road inspection tasks. With the rapid advancement of Convolutional Neural Networks (CNNs) and computational power, deep-learning-based semantic segmentation methods have gradually become the mainstream approach in crack segmentation. These methods can automatically learn features, reducing reliance on prior information and achieving outstanding segmentation accuracy. Yang et al. [6] proposed the Feature Pyramid and Hierarchical Boosting Network (FPHBN) for pavement crack detection, which integrates multi-scale features through a feature pyramid and mitigates class imbalance using a reweighting strategy. Di Benedetto et al. [7] developed a crack segmentation method based on an improved UNet architecture, achieving superior performance compared to other UNet variants. Saberironaghi et al. [8] designed a lightweight network, DepthCrackNet, which incorporates a dual convolutional encoder structure and a spatial depth enhancement module, significantly improving the extraction of crack detail information. Qi et al. [9] introduced a lightweight network, GMDNet, which integrates multi-scale convolutional attention mechanisms and outperforms traditional large-parameter models in segmentation accuracy, highlighting the potential of lightweight networks in crack segmentation tasks. However, current methods still lack effective optimization for challenging regions such as crack edges and require further improvement in balancing segmentation accuracy with inference speed.

To address the specific challenges of crack segmentation, we optimized the model in three key areas: feature extraction networks, sampling processes, and loss functions. First, to maintain a lightweight architecture, we improved EfficientNet [10] by pruning and adding skip connections, enhancing the network capacity to utilize spatial information. Second, to prevent the loss of image details that occurs with traditional sampling methods, we replaced the original sampling mechanisms with Haar Wavelet Down-sampling (HWD) [11] and Carafe up-sampling [12], which better preserve image details. Finally, we employed a dynamic weighting strategy for the loss function, allowing the model to gradually focus on difficult-to-segment areas, such as crack edges.

The main contributions of this paper are summarized as follows:

A lightweight spatial information-enhanced fully convolutional neural network is proposed, achieving advanced segmentation accuracy and model capacity.
The model improves up-sampling and down-sampling processes by combining adaptive feature reassembly and wavelet transform, enhancing capacity to capture spatial information from images.
A dynamic combined loss function is introduced, enabling the model to effectively focus on the crack edge regions, thereby improving the final segmentation accuracy.

The remainder of this paper is organized as follows: Section 2 reviews related work. Section 3 presents our road crack segmentation model. Section 4 validates the effectiveness of our method through comparative and ablation experiments. Section 5 concludes the paper.

2. Related Work

In this section, we provide an overview of classical crack detection methods, semantic segmentation models, and research that combines crack detection with semantic segmentation, organized into three parts.

A.: Crack Detection

Early crack detection methods include threshold algorithms [13], feature clustering algorithms [14], and random forest algorithms [15]. These methods rely on manually designed prior information to construct detection algorithms. For example, Zhou et al. utilized wavelet analysis to extract crack edges, resulting in crack profiles with greater continuity compared to traditional edge-detection algorithms. However, such methods typically perform poorly in complex backgrounds and under varying lighting conditions, exhibiting limited generalization capabilities. With the rapid advancement of deep learning, convolutional neural networks (CNNs) have become increasingly prevalent in crack detection applications. For instance, Kang et al. [16] employed Faster R-CNN for crack object detection, followed by segmentation of the cracks within the bounding boxes using a corrected tubular flow method. Although this approach achieved high crack segmentation accuracy, the segmentation algorithm still depended on prior information and lacked robustness against complex backgrounds. In recent years, end-to-end deep-learning-based segmentation models have begun to demonstrate significant potential. DeepCrack [17] is a typical example. This network effectively handles cracks of varying scales through multi-scale feature fusion and deep convolution, overcoming the limitations of traditional methods that rely on prior information design.

B.: Semantic Segmentation

With the advent of fully convolutional networks (FCN), the field of semantic segmentation entered a period of rapid development, leading to a surge of improved models and innovative architectures. Classic networks such as U-Net [18] introduced skip connections that directly pass high-resolution features to the decoder, effectively improving segmentation accuracy and becoming a foundation for subsequent models. The DeepLab series [19] set a benchmark for accuracy with its use of dilated convolutions and multi-scale context capture capabilities. HRNet [20], with its unique multi-resolution feature parallel processing and high-resolution preservation ability, has demonstrated exceptional performance in fine-grained segmentation tasks. However, these classic models typically involve large numbers of parameters, making them challenging to meet the demands of real-time applications.

To address the limitations of embedded devices and real-time scenarios, the research on lightweight networks has gradually emerged. DDRNet [21] adopts an efficient dual-branch network structure, achieving real-time inference while maintaining accuracy. Fast-SCNN [22], through hierarchical feature extraction and a lightweight decoder, is well suited for resource constrained environments. PP-MobileSeg [23] uses depthwise separable convolutions and efficient attention mechanisms to reduce computational cost and memory requirements while maintaining high segmentation accuracy. The MobileNet series [24] further enhances lightweight performance through depthwise separable convolutions and efficient parameter design. EfficientNet [10] optimizes the trade-off between model size and accuracy with its compound scaling strategy.

In summary, research on lightweight networks has become a key development direction in modern semantic segmentation. Striking the right balance between efficiency and accuracy has become the core issue in current research.

C.: Research on the Combination of Crack Detection and Semantic Segmentation

In recent years, semantic segmentation has progressively become the mainstream approach for automated road crack detection, owing to its ability to address the limitations of traditional algorithms. Qu et al. [25] proposed a deep supervised convolutional neural network for crack detection, incorporating a multi-scale convolutional feature fusion module. However, its computational efficiency is insufficient for real-time detection. Fan et al. [26] introduced a strip pooling module and an attention-based skip connection strategy, effectively suppressing irrelevant regions and enhancing the network’s ability to segment elongated cracks, yet it lacks dedicated optimization for crack edge information. Chu et al. [27] developed a multi-scale crack feature extraction network (MsCFEN), enabling effective capture of both transverse and longitudinal crack features. However, its high demand for computational and storage resources hinders deployment on mobile devices. Chen et al. [28] proposed a crack recognition method that combines feature pyramids with a memory mechanism, significantly enhancing generalization and robustness across multiple datasets, but its accuracy on public datasets still leaves room for improvement.

As the demand for real-time detection increases, lightweight crack segmentation networks have gained prominence as a research focus. Studies indicate that lightweight networks, by minimizing the impact of redundant information, demonstrate superior performance in crack segmentation tasks [8,9]. Hong et al. [29] enhanced the encoder structure of U-Net and adopted a fusion strategy with long and short skip connections, reducing network complexity effectively, although deficiencies persist in segmenting fine crack details. Zhang et al. [30] proposed a lightweight segmentation network, CrackScopeNet, incorporating multi-scale branches and strip-shaped contextual attention mechanisms to improve sensitivity to cracks of varying sizes. However, its performance is limited for cracks that blend with the background. Al-maqtari et al. [31] integrated multi-feature extraction, edge extraction, pixel expansion, attention modules, and efficient feature fusion modules, achieving significant improvements in segmentation accuracy across multiple datasets. Nonetheless, their model still struggles with fine crack recognition. Wang et al. [32] proposed the lightweight crack segmentation model C-UNet by employing a multi-scale cascade strategy and optimizing the model structure. Despite its advancements, the model exhibits poor performance in identifying fine cracks and is prone to misclassification under stain interference. Tao et al. [33] introduced DSDGNet, a weakly supervised method based on dual separation and domain generalization, which improves generalization by generating high-quality crack data and addressing the challenges of crack data annotation. However, the model’s computational efficiency still requires enhancement to meet the demands of real-time detection.

The existing methods exhibit several critical limitations: First, many models overly rely on large-scale parameters to achieve higher accuracy, which constrains their real-time applicability in practical engineering contexts. Second, lightweight methods lack sufficient optimization for crack-specific characteristics. The use of traditional up-sampling and down-sampling strategies leads to significant loss of crack detail during feature extraction and reconstruction, particularly affecting the segmentation of fine cracks. Lastly, most methods utilize cross-entropy as the loss function. Given the severe class imbalance in crack images and the intricate edge features, a single static loss function is inadequate for effectively optimizing model parameters, particularly in capturing crack boundaries and details.

3. Methods

In this section, we first introduce the overall structure of the neural network, followed by a description of the up-sampling and down-sampling methods employed by the model. Finally, we discuss the dynamic combined loss function utilized in the model.

3.1. Lightweight Fully Convolutional Neural Network

High-resolution semantic segmentation networks are capable of preserving spatial information in images and delivering precise segmentation results. However, these networks typically require multiple rounds of feature extraction and fusion for high-resolution feature maps, resulting in significantly increased parameter counts and computational demands. Additionally, such complex models are prone to overfitting when applied to specific small-scale scenarios. Previous studies have demonstrated that lightweight networks often outperform complex ones in crack segmentation tasks [9,19].

This study focuses on real-world pavement crack detection scenarios, such as road maintenance vehicle-mounted equipment or mobile terminal applications. These devices are typically constrained by limited computational capacity (e.g., ARM processors) and storage resources, necessitating the use of efficient and lightweight models.

To address these challenges, we propose a lightweight spatial-information-enhanced, fully convolutional neural network for pavement crack segmentation. The network is based on the classic fully convolutional network (FCN) architecture and uses the lightweight EfficientNet-B0 model as the backbone network. EfficientNet-B0 was chosen primarily for its compound scaling strategy, which balances parameter count, computational cost, and accuracy. Furthermore, compared to larger EfficientNet variants, B0 exhibits significantly reduced model complexity, making it more suitable for mobile devices and embedded systems. The core module of EfficientNet is the lightweight MBConv (Mobile Inverted Bottleneck Convolution). MBConv integrates depthwise separable convolutions with the SE (Squeeze-and-Excitation) channel attention mechanism, further enhancing the efficiency and performance of the model. The structure of MBConv is illustrated in Figure 1.

The original EfficientNet architecture is relatively deep, which can lead to over-abstracted target features and the loss of crucial spatial information. To address this limitation, this study reduces the number of down-sampling operations in the backbone network to four, preserving more spatial details. Over-abstraction often results in excessively smoothed segmentation boundaries, hindering the accurate detection of fine crack features—a critical aspect of crack segmentation tasks. Given that crack segmentation involves only two classes—background and target—the informational complexity is relatively low. Therefore, this study also prunes the number of blocks and feature map channels at each down-sampling stage to align the model’s capacity with the simplicity of the input data. The optimized network architecture is illustrated in Figure 2.

In the FCN network, the decoder restores the encoder’s output to a segmented image. FCN16s and FCN8s further enhance this process by incorporating the 16× and 8× down-sampled feature maps into the decoder’s up-sampling process, effectively fusing spatial and semantic information to improve segmentation accuracy. For pavement crack segmentation, enhancing the model’s ability to perceive and process spatial details is critical. Therefore, this study introduces a strategy where, at each down-sampling stage (2×, 4×, 8×, and 16×), feature maps rich in spatial information are repeatedly integrated into the corresponding up-sampling stages via skip connections. This approach facilitates a more effective fusion of deep semantic features with shallow spatial details, enhancing the model’s ability to capture intricate crack edges and preventing the loss of fine crack features during the convolutional process. While this method marginally increases computational complexity, it yields substantial improvements in segmentation performance.

3.2. Haar Wavelet Down-Sampling

Semantic segmentation models often employ strided convolution or pooling methods for down-sampling feature maps. While these approaches reduce spatial resolution, they inevitably cause the loss of critical information, particularly spatial details such as edges and textures. This issue is especially pronounced in crack segmentation tasks, where cracks are typically thin, irregular, and have low contrast with the background. The loss of fine details significantly impacts segmentation performance.

To address this challenge, we propose a down-sampling module based on Haar wavelet transform (Haar Wavelet Down-sampling, HWD). This module enables resolution reduction while preserving spatial information effectively. Haar wavelet transform is a simple yet efficient discrete wavelet transform that is well suited for image processing. Specifically, for a 2D feature map of size H × W, Haar wavelet transform decomposes it into four sub-components of size H/2 × W/2: one low-frequency component H₁ and three high-frequency components corresponding to horizontal H₂, vertical H₃, and diagonal directions H₄. These sub-components collectively retain comprehensive spatial information about the input feature map.

Figure 3 illustrates the workflow of the HWD module. First, the input feature map undergoes Haar wavelet transform, which decomposes it into four sub-components: one low-frequency approximation and three high-frequency details corresponding to horizontal, vertical, and diagonal directions. These sub-components are then concatenated along the channel dimension, effectively reducing the spatial resolution by half while increasing the number of channels to four times the original count. Subsequently, a 1 × 1 convolution is applied to compress the channel dimension to match the specific requirements of downstream tasks. This design not only reduces the feature map resolution while preserving critical spatial information but also enhances the feature extraction capability of subsequent convolutional layers through the reorganization of information across the channel dimension. The mathematical expression of the process is as follows:

\begin{array}{l} (H_{1}, H_{2}, H_{3}, H_{4}) = HWT (x) \\ x^{'} = Conv (Concat (H_{1}, H_{2}, H_{3}, H_{4})) \end{array}

(1)

The fundamental distinction between the down-sampling method based on Haar wavelet transform and traditional approaches such as strided convolution or pooling lies in its ability to preserve more detailed information during down-sampling. Haar wavelet transform maps spatial information into the channel dimension instead of discarding it outright. This characteristic enables the model to extract richer contextual information from these lossless features, significantly enhancing its ability to capture fine details such as crack edges and textures. Moreover, Haar wavelet transform is computationally effi-cient, with a low complexity that aligns well with the design requirements of lightweight semantic segmentation networks. Consequently, the introduction of the HWD module not only improves the segmentation performance of the model but also enhances its adaptability to complex details in crack images, making it particularly suitable for challenging real-world applications.

3.3. Carafe Up-Sampling

Up-sampling plays a pivotal role in dense prediction tasks such as semantic segmentation. Commonly used up-sampling operators include interpolation methods and transposed convolutions. Interpolation methods rely solely on spatial positions to define the up-sampling kernel, ignoring the semantic information embedded in the feature map. Consequently, they have a limited receptive field, which often causes adjacent target regions to merge or disappear during up-sampling, a problem particularly evident in dense and fragmented crack targets. In contrast, transposed convolutions learn up-sampling kernels through the network, enabling better integration of semantic information. However, these kernels are applied uniformly across all image regions, lacking adaptability to local variations. Additionally, transposed convolutions significantly increase the parameter count and computational complexity, presenting challenges in scenarios where computational efficiency is critical.

To address the unique challenges of pavement crack segmentation tasks, this study adopts the Carafe up-sampling operator as a replacement for traditional up-sampling methods. Carafe is a lightweight and efficient approach that generates content-aware up-sampling kernels, enabling better integration of contextual information. These dynamically generated kernels facilitate adaptive feature reorganization, achieving more effective up-sampling. The computational workflow of Carafe is depicted in Figure 4.

The Carafe operator can be divided into two modules: the up-sampling kernel prediction module and the feature reassembly module.

In the up-sampling kernel prediction module, for a feature map

x \in R^{H \times W \times C}

to be up-sampled by a factor of

σ

, the channel dimension is first reduced to ccc using a 1 × 1 convolution. Assuming the size of the up-sampling kernel is

K \times K

, the shape of the up-sampling kernel to be predicted for each output location is

σ H \times σ W \times K^{2}

. The compressed feature map is then subjected to another convolution operation, transforming its shape to

H \times W \times σ^{2} K^{2}

. The data in the channel dimension of the feature map is then flattened in the spatial dimension and normalized using softmax, resulting in an up-sampling kernel of shape

σ H \times σ W \times K^{2}

.

In the feature reorganization module, for each point in the up-sampled output feature map, the corresponding position in the input feature map is located, and a feature sub-map from the

K \times K

neighborhood at that position is extracted. This sub-map is then multiplied with the up-sampling kernel of the corresponding point in the output feature map via a dot product operation. A 1 × 1 convolution is applied to compress the number of channels to

C^{'}

dimensions, producing the output

x^{'}

. It is important to note that the same up-sampling kernel is shared across different channels at the same position.

3.4. Dynamic Combined Loss Function

Cross-entropy is a commonly used loss function for classification tasks. In semantic segmentation, the cross-entropy loss function focuses on pixel-wise classification, aiming to increase the confidence of output pixels in their true categories. The cross-entropy loss function can be defined as follows:

C E L o s s (y, \hat{y}) = - \sum_{i = 1}^{c} y_{i} \log {\hat{y}}_{i}

(2)

where

c

represents the total number of classes,

y

represents the true probability distribution, and

\hat{y}

represents the predicted probability distribution from the model across all

c

classes, respectively.

Although the cross-entropy loss function is widely used, it assigns equal weight to all pixels, overlooking the importance of boundary regions. This limitation reduces its ability to focus effectively on the edges of target regions, leading to blurry boundary details in crack segmentation [34,35]. The Dice coefficient is used to evaluate the overlap between the predicted results and the ground truth labels. Let the predicted binary mask be A and the actual binary mask be B, the Dice coefficient is calculated as follows:

D i c e (A, B) = \frac{|A| + |B|}{2 \times |A \cap B|}

(3)

DiceLoss is based on the Dice coefficient, and its calculation formula is as follows:

D i c e L o s s = 1 - D i c e (A, B)

(4)

DiceLoss, based on overlap, can better focus on easily misclassified edge regions. Additionally, it effectively handles class imbalance issues, making it particularly suitable for pavement cracks, which are often minority classes with intricate edge details. However, using DiceLoss as the sole loss function in the early stages of network training can lead to instability in model training. Introducing DiceLoss when the network has achieved a certain level of discriminative ability can guide the model to more stably focus on the edge regions.

FocalLoss is based on an improved version of the cross-entropy loss function. It dynamically adjusts the contribution of samples to the loss function based on the disparity between predicted probabilities and true probabilities, allowing the network to focus on learning from harder-to-classify samples. This enhances the network capacity to recognize difficult pixels (primarily in boundary regions), leading to more precise segmentation of the target area. The calculation formula for FocalLoss is as follows:

F o c a l L o s s (y, \hat{y}) = - \sum_{i}^{c} α_{i} {(1 - {\hat{y}}_{i})}^{γ} y \log {\hat{y}}_{i}

(5)

where

α_{i}

is the weighting factor for class i,

γ

is the adjustment factor, and the meanings of the other symbols remain the same.

Given the characteristics of cross-entropy loss, Dice loss, and Focal loss, appropriately assigning weights to these three loss functions for joint optimization often results in better segmentation performance. Cross-entropy loss is suitable for improving overall accuracy, Dice loss focuses on addressing class imbalance, and Focal loss emphasizes hard-to-classify samples, especially in boundary regions. However, fixed weight assignments limit the model’s flexibility at different training stages and make it difficult to find optimal parameters, which may lead to instability during training or suboptimal segmentation results.

To overcome this issue, this paper proposes a dynamic weight adjustment method based on a nonlinear function. As the training progresses, the model’s focus changes at different stages. Specifically, as the number of training epochs increases, we dynamically allocate different weights to the three loss functions to ensure that, in the early stages of training, the model focuses on improving overall accuracy, while in the later stages, it pays more attention to segmenting crack edge details. The weight

β (n)

at the n-th epoch is given by:

β (n) = β_{\max} - \frac{e^{- 3 \frac{n}{N}} - 1}{e^{- 3} - 1} (β_{\max} - β_{\min}) \in [β_{\min}, β_{\max}]

(6)

where N represents the maximum number of training epochs, and

β_{\max}

and

β_{\min}

denote the predefined maximum and minimum weight values, respectively.

Under the four different preset values of

β_{\max}

and

β_{\min}

, the variation trend of the dynamic weight

β (n)

is shown in Figure 5. As the number of training epochs increases,

β (n)

exhibits a nonlinear decay, gradually decreasing from

β_{\max}

to

β_{\min}

, with the decay rate gradually slowing down.

Thus, the combined loss function for the n-th training iteration is expressed as:

L o s s_{n} = β (n) \times C E L o s s + (1 - β (n)) \times (D i c e L o s s + F o c a L o s s)

(7)

The combined loss function signifies that, during the early stages of model training, the cross-entropy loss function plays a primarliy role, enabling the model to focus on the overall information of cracks and effectively recognize the entire area and approximate contours of the cracks. As training progresses, the weights of DiceLoss and FocalLoss gradually increase, guiding the model to concentrate on the challenging-to-segment edge regions of the cracks. Ultimately, this approach leads to accurate segmentation of both the overall crack and its edge details by the network.

4. Results

4.1. Dataset

This study primarily focuses on pavement crack detection, and therefore, we selected two public datasets as the research subjects: the Crack500 dataset for concrete pavement cracks [6] and the AsphaltCrack300 dataset for asphalt pavement cracks [36]. The Crack500 dataset contains 3368 images of road cracks collected under various weather conditions, lighting, and angles, documenting cracks in different forms. The AsphaltCrack300 dataset further expands the variety of crack types, with samples from different road scenes, featuring coarser road material particles and more varied crack forms. Many cracks in the Crack500 dataset are very thin or similar in color to the background, and some images contain noise such as shadows, gravel, and stains, adding complexity to the segmentation task.

In this study, 1896 images from the Crack500 dataset were used for training, 1124 images for validation, and 348 images for final testing, with all images uniformly cropped to a size of 360 × 640 pixels. To explore the generalization ability of the proposed method across different scenarios and crack types, the model trained on Crack500 was directly applied to all samples in the AsphaltCrack300 dataset. This cross-dataset validation allows for the evaluation of the proposed method’s performance on various crack types and road conditions, providing a better understanding of its applicability in real-world scenarios.

4.2. Implementation Details

4.2.1. Experimental Platform

The experiments in this study were conducted on a computer equipped with a GeForce RTX 2060 Super 8 GB GPU and an AMD Ryzen 5 5500 CPU, with the model implemented using the PyTorch 2.0 framework. Due to computational resource constraints, a gradient accumulation strategy was employed, setting the batch size to 8 and updating network parameters after accumulating gradients over 4 batches. This approach achieved an effective batch size equivalent to 32 without increasing memory consumption. To optimize network parameters and avoid local optima, the Adam optimizer was used with an initial learning rate of 0.001, and the maximum number of training epochs was set to 50. An early stopping strategy was applied during training to prevent overfitting.

4.2.2. Evaluation Metrics

This paper evaluates the model’s segmentation performance using six metrics: Mean Intersection over Union (MIoU), F1-score, Accuracy, model Parameters (Params), Floating Point Operations (FLOPs), and Frames Per Second (FPS). These metrics provide a comprehensive assessment of segmentation accuracy, model complexity, computational efficiency, and real-time performance, offering a balanced evaluation of both effectiveness and efficiency. The calculation formulas for MIoU, F1-score, and Accuracy are as follows:

MIoU = \frac{1}{N} \sum_{N = 1}^{N} \frac{T P}{F N + F P + T P}

(8)

F 1 - score = \frac{2 \times Precision \times Recall}{Precision + Recall}

(9)

Recall = \frac{T P}{T P + F N}

(10)

Precision = \frac{T P}{T P + F P}

(11)

Accuracy = \frac{T P + T N}{T P + F P + T N + F N}

(12)

where TP represents the number of pixels correctly predicted as positive, FP represents the number of pixels incorrectly predicted as positive, FN represents the number of pixels incorrectly predicted as negative, TN represents the number of pixels correctly predicted as negative, and N represents the total number of classes.

4.3. Model Comparison Experiments

To validate the effectiveness of the proposed model, we compared it with classic semantic segmentation models such as UNet, HRNet, and DeepLabV3+ [18,19,20,21], as well as lightweight models like EDANet and ESNet [37,38,39,40] on the Crack500 test set. Additionally, we compared it with methods from references [8,31,33,41]. The experimental methods were implemented according to Section 4.2, with the cross-entropy loss function used during training. The results of the various models on the evaluation metrics are presented in Table 1. And segmentation results on representative samples are shown in Figure 6.

The experimental results demonstrate that the proposed model achieves an MIoU of 79.34% and an F1-score of 76.45%, outperforming all the compared models and significantly surpassing lightweight models such as DABNet. In terms of complexity, the proposed model has only 0.23 M parameters and 0.95G FLOPs, achieving high segmentation accuracy with minimal computational complexity. Additionally, its inference speed reaches 73.49 FPS, performing well among lightweight models and exceeding most complex models. Overall, the proposed model achieves an excellent balance between performance and complexity, making it suitable for practical applications.

The visualization results indicate that many models struggle with segmenting fine cracks (see left column 1), with some failing to even detect their presence (e.g., HRNet, DeepLabV3+). In contrast, the proposed model accurately captures the contours and edges of fine cracks. On heavily worn pavements (see left columns 3 and 6), some models have difficulty distinguishing between the background and cracks (e.g., UNet, ContextNet), while the proposed model provides the best overall segmentation of both the crack region and its boundaries. On pavements with stains and stone interference (see left columns 5 and 7), some models are influenced by the noisy information, misclassifying noise as cracks (e.g., EDANet, MobileNetV3). However, the proposed model exhibits strong robustness, maintaining excellent segmentation performance. In the segmentation of mesh-like cracks (see left column 4), the proposed model performs comparably well with networks such as MobileNetV3 and HRNet. Overall, the proposed model demonstrates excellent segmentation capabilities across various scenarios.

Several models trained on the Crack500 dataset were directly tested on all samples of the AsphaltCrack300 dataset for cross-dataset evaluation, with the results presented in Table 2. Due to changes in data distribution, the accuracy of all models decreased compared to their performance on the Crack500 dataset. However, the proposed model demonstrated outstanding performance in the cross-dataset evaluation, achieving an MIoU of 69.63% and an F1-score of 63.78%, ranking the best among lightweight models and even surpassing some models with larger parameter sizes. With the lowest model complexity, the proposed model achieved excellent segmentation accuracy, highlighting its strong generalization ability and reliability, maintaining stable performance even in new application scenarios.

The segmentation results on representative samples from the AsphaltCrack300 dataset are shown in Figure 7. Due to the coarse granularity of the asphalt material, many models struggle to effectively distinguish between background and crack pixels (e.g., DeepLabV3+, DDRNet). In scenarios where the distinction between crack regions and the background is low (see left columns 3 and 6), several models exhibit misclassification (e.g., HRNet, ESNet, ContextNet). Under strong lighting interference (see left column 5), the proposed model achieves the closest results to the ground truth. Overall, the proposed model demonstrates the best segmentation accuracy and generalization ability across datasets.

4.4. Loss Function Comparison Experiment

To validate the effectiveness of the proposed dynamic combined loss function, experiments were conducted using a single loss function, a fixed combined loss function, and various dynamic weight schemes. The model employed in these experiments was the one proposed in this study, and the dataset used was Crack500. The experimental results are presented in Table 3.

From the table, it can be observed that the model trained with the dynamic combined loss function generally outperformed those trained with a single loss function. This indicates that leveraging the advantages of multiple loss functions can lead to better segmentation performance. Moreover, the dynamic weight scheme demonstrated a significant performance improvement over the fixed weight scheme in the final evaluation metrics. The dynamic weighting of CELoss (CL), DiceLoss (DL) and FocalLoss (FL) yielded optimal results when the parameter

β

was set within the range of [0.2, 0.8].

4.5. Ablation Studies

To evaluate the impact of various improvements on model performance, this paper uses the FCN8s model with EfficientNet-B0 as the backbone and progressively introduces enhancements, including pruning, the Carefe up-sampling module, additional skip connections, the HWD module, and a dynamic combined loss function. Furthermore, the Carefe up-sampling, skip connections, the HWD module, and dynamic combined loss function are added individually to the pruned model to better assess the contribution of each improvement to the model’s final performance. Ablation experiments are conducted on the Crack500 dataset to verify the effectiveness of these improvements, with the results presented in Table 4.

After the pruning operation, the segmentation performance of the model declined significantly. To address this issue, the Carefe up-sampling module, skip connections, the HWD module, and dynamic combined loss function were individually incorporated into the pruned model. Each improvement contributed to enhancing the segmentation accuracy to varying extents. Furthermore, by progressively integrating all proposed enhancements, the segmentation performance achieved consistent improvements, fully validating the effectiveness of the proposed method. Ultimately, the model achieved an MIoU of 80.37% and an F1-score of 78.22%, which represent increases of 3.08% and 5.62%, respectively, compared to the original model, while significantly reducing the parameter count and computational cost. Although the inference speed decreased, the model still satisfies the requirements for real-time road crack segmentation. The reduction in frame rate can be partly attributed to GPUs being more optimized for standard convolutions and other conventional operations, while offering limited optimization support for complex modules like HWD and Carafe up-sampling. Future improvements using tools such as TensorRT and ONNX could further enhance the inference speed.

Previous studies [22,27] have indicated that networks in crack segmentation tasks are prone to parameter redundancy, which can negatively impact accuracy. To verify the necessity of the pruning strategy, this study conducted ablation experiments by excluding the pruning operation to explore the relationship between segmentation performance and computational efficiency. The experimental results are presented in Table 5.

Experimental results indicate that after incorporating the Carafe up-sampling, the model’s accuracy significantly improved due to the optimization of the decoder, but the inference speed decreased substantially. After adding skip connections, the model’s accuracy slightly declined, primarily because the skip connections introduced redundant information into the decoder, which negatively affected the segmentation performance. With the introduction of the HWD module and dynamic combined loss function, the model’s accuracy continued to improve overall, and the final segmentation performance reached an MIoU of 80.15% and an F1-score of 77.64%. However, compared to the original model, the inference speed was significantly reduced to one-third of its original value, and the segmentation accuracy remained lower than the proposed improved model. The results of two ablation experiments demonstrate that pruning plays a beneficial role in crack segmentation tasks, helping to achieve a balance between model accuracy and inference speed.

To enhance the interpretability of the experiments, Grad-CAM technology [39] was employed to visualize the original model and the proposed model. Grad-CAM generates a heatmap that reflects the regions of interest in the model by combining the feature maps from the last convolutional layer with the gradient information of the class prediction. This approach facilitates an intuitive understanding of the model’s decision-making process. In the heatmap, warm colors indicate regions with high attention, while cool colors indicate regions with low attention. The visualization results are shown in Figure 8.

The heatmap reveals that the original model exhibits low attention to the thinnest parts of fine cracks during recognition, a limitation that could result in fragmented crack detection outcomes. In contrast, the proposed model demonstrates superior performance in this regard, significantly improving the recognition of fine cracks. Moreover, the original model shows noticeably lower attention to crack edges compared to the proposed model. The latter effectively captures spatial details with greater precision, markedly enhancing attention to crack edges and enabling more accurate region segmentation.

4.6. Limitations

Our model demonstrates excellent segmentation performance and application potential on the Crack500 and AsphaltCrack300 datasets, but there are still certain limitations. First, although the analysis of the comparative experimental results in this paper thoroughly discusses specific conditions in the datasets, such as road wear, stain interference, fine cracks, and strong light interference, there are differences between the actual application scenarios and the samples in the datasets (e.g., extreme weather, different paving materials). These differences may lead to a decrease in segmentation accuracy in certain situations, posing challenges for generalization. Future research will focus on expanding the dataset types and improving the robustness and generalization capability of the model.

Second, the current experiments have only been conducted on a computer platform, and the model has not yet been optimized using tools such as TensorRT and ONNX. Future work will focus on deploying the model to mobile platforms or handheld devices to enable intelligent road crack detection.

Finally, the current research is mainly focused on static image crack detection. To further enhance the application capability of the model, future research will explore the combination of Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), to perform time-series analysis and crack segmentation in more complex road scenarios. This will help improve the accuracy and adaptability of the model in dynamic environments.

5. Conclusions

To tackle the challenges of blurred segmentation edges, loss of detail, and excessive computational costs in pavement crack segmentation, this study presents a lightweight, spatial-information-enhanced, fully convolutional neural network model. First, the EfficientNet architecture was pruned, and skip connections were integrated to reduce network parameters while improving the retention and utilization of spatial information. Second, a wavelet transform down-sampling module and an adaptive feature reassembly up-sampling module were introduced to enhance the network’s capacity for spatial detail perception. Finally, a dynamic combined loss function was employed to train the network, progressively refining its focus on edge information within images.

We validated the proposed model on the Crack500 dataset, where it demonstrated superior performance compared to other mainstream models. The model achieved an MIoU of 80.37%, an F1-score of 78.22%, and a parameter count of only 0.23 M, with an inference speed of 73.49 frames per second. This performance reflects an excellent balance between segmentation accuracy and inference speed, meeting the requirements of road inspection for both precision and efficiency. Additionally, cross-domain testing on the AsphaltCrack300 dataset confirmed the model’s strong generalization capability across diverse road scenarios.

Author Contributions

Conceptualization, Y.Y. and W.X.; Data curation, B.H.; Formal analysis, Y.Y.; Investigation, W.X.; Methodology, Y.Y.; Project administration, B.H.; Resources, B.H.; Software, W.X.; Supervision, Z.Z.; Validation, Z.Z. and B.H.; Visualization, Y.Y.; Writing—original draft, Y.Y.; Writing—review & editing, Z.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used in this paper were generated using the publicly available datasets Crack500 and AsphaltCrack300 (https://github.com/guoguolord/CrackDataset, accessed on 10 May 2023). The data files related to this study are available upon request by email from the corresponding authors. Data supporting the conclusions of this study are accessible from the corresponding author upon reasonable request.

Acknowledgments

The authors would like to thank each member of the team for their efforts.

Conflicts of Interest

Author Wenjun Xia was employed by the company China Railway Siyuan Survey and Design Group Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Nguyen, S.D.; Tran, T.S.; Tran, V.P.; Lee, H.J.; Piran, M.J.; Le, V.P. Deep Learning-Based Crack Detection: A Survey. Int. J. Pavement Res. Technol. 2023, 16, 943–967. [Google Scholar] [CrossRef]
Xing, Y.; Han, X.; Pan, X.; An, D.; Liu, W.; Bai, Y. EMG-YOLO: Road Crack Detection Algorithm for Edge Computing Devices. Front. Neurorob. 2024, 18, 1423738. [Google Scholar] [CrossRef] [PubMed]
Alshawabkeh, S.; Wu, L.; Dong, D.; Cheng, Y.; Li, L.; Alanaqreh, M. Automated Pavement Crack Detection Using Deep Feature Selection and Whale Optimization Algorithm. Comput. Mater. Contin. 2023, 77, 63–77. [Google Scholar] [CrossRef]
Qiang, S.; Guoying, L.; Jingqi, M.; Hongmei, Z. An Edge-Detection Method Based on Adaptive Canny Algorithm and Iterative Segmentation Threshold. In Proceedings of the 2016 2nd IEEE International Conference on Control Science and Systems Engineering (ICCSSE), Singapore, 27–29 July 2016; pp. 64–67. [Google Scholar]
Tsalera, E.; Papadakis, A.; Samarakou, M.; Voyiatzis, I. Feature Extraction with Handcrafted Methods and Convolutional Neural Networks for Facial Emotion Recognition. Appl. Sci. 2022, 12, 8455. [Google Scholar] [CrossRef]
Yang, F.; Zhang, L.; Yu, S.; Prokhorov, D.; Mei, X.; Ling, H. Feature Pyramid and Hierarchical Boosting Network for Pavement Crack Detection. IEEE Trans. Intell. Transp. Syst. 2020, 21, 1525–1535. [Google Scholar] [CrossRef]
Di Benedetto, A.; Fiani, M.; Gujski, L.M. U-Net-Based CNN Architecture for Road Crack Segmentation. Infrastructures 2023, 8, 90. [Google Scholar] [CrossRef]
Saberironaghi, A.; Ren, J. DepthCrackNet: A Deep Learning Model for Automatic Pavement Crack Detection. J. Imaging 2024, 10, 100. [Google Scholar] [CrossRef]
Qi, Y.; Wan, F.; Lei, G.; Liu, W.; Xu, L.; Ye, Z.; Zhou, W. GMDNet: An Irregular Pavement Crack Segmentation Method Based on Multi-Scale Convolutional Attention Aggregation. Electronics 2023, 12, 3348. [Google Scholar] [CrossRef]
Tan, M.; Le, Q.V. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 10–15 June 2019; PMLR: Breckenridge, CO, USA, 2019. [Google Scholar]
Xu, G.; Liao, W.; Zhang, X.; Li, C.; He, X.; Wu, X. Haar Wavelet Downsampling: A Simple but Effective Downsampling Module for Semantic Segmentation. Pattern Recognit. 2023, 143, 109819. [Google Scholar] [CrossRef]
Wang, J.; Chen, K.; Xu, R.; Liu, Z.; Loy, C.C.; Lin, D. CARAFE: Content-Aware ReAssembly of FEatures. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV 2019), Seoul, Republic of Korea, 27 October–2 November 2019; IEEE Computer Soc.: Los Alamitos, CA, USA, 2019; pp. 3007–3016. [Google Scholar]
Peng, L.; Chao, W.; Shuangmiao, L.; Baocai, F. Research on Crack Detection Method of Airport Runway Based on Twice-Threshold Segmentation. In Proceedings of the 2015 IEEE Fifth International Conference on Instrumentation and Measurement, Computer, Communication and Control (IMCCC), Qinhuangdao, China, 18–20 September 2015; pp. 1716–1720. [Google Scholar]
Matsuoka, T.; Matsushima, K. Crack Detection Using Spectral Clustering Based on Crack Features. In Proceedings of the 2016 IEEE Region 10 Conference (TENCON), Singapore, 22–25 November 2016; IEEE: New York, NY, USA, 2016; pp. 2575–2578. [Google Scholar]
Shi, Y.; Cui, L.; Qi, Z.; Meng, F.; Chen, Z. Automatic Road Crack Detection Using Random Structured Forests. IEEE Trans. Intell. Transp. Syst. 2016, 17, 3434–3445. [Google Scholar] [CrossRef]
Kang, D.; Benipal, S.S.; Gopal, D.L.; Cha, Y.-J. Hybrid Pixel-Level Concrete Crack Segmentation and Quantification across Complex Backgrounds Using Deep Learning. Autom. Constr. 2020, 118, 103291. [Google Scholar] [CrossRef]
Zou, Q.; Zhang, Z.; Li, Q.; Qi, X.; Wang, Q.; Wang, S. DeepCrack: Learning Hierarchical Convolutional Features for Crack Detection. IEEE Trans. Image Process. 2019, 28, 1498–1512. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015, Munich, Germany, 5–9 October 2015; Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F., Eds.; Lecture Notes in Computer Science. Springer International Publishing: Cham, Switzerland, 2015; Volume 9351, pp. 234–241, ISBN 978-3-319-24573-7. [Google Scholar]
Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Proceedings of the Computer Vision—ECCV 2018, Munich, Germany, 8–14 September 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Lecture Notes in Computer Science. Springer International Publishing: Cham, Switzerland, 2018; Volume 11211, pp. 833–851, ISBN 978-3-030-01233-5. [Google Scholar]
Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep High-Resolution Representation Learning for Human Pose Estimation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 5686–5696. [Google Scholar]
Hong, Y.; Pan, H.; Sun, W.; Jia, Y. Deep Dual-Resolution Networks for Real-Time and Accurate Semantic Segmentation of Road Scenes. arXiv 2021, arXiv:2101.06085. [Google Scholar]
Poudel, R.P.K.; Liwicki, S.; Cipolla, R. Fast-SCNN: Fast Semantic Segmentation Network. arXiv 2019, arXiv:1902.04502. [Google Scholar]
Tang, S.; Sun, T.; Peng, J.; Chen, G.; Hao, Y.; Lin, M.; Xiao, Z.; You, J.; Liu, Y. PP-MobileSeg: Explore the Fast and Accurate Semantic Segmentation Model on Mobile Devices. arXiv 2023, arXiv:2304.05152. [Google Scholar]
Howard, A.; Sandler, M.; Chen, B.; Wang, W.; Chen, L.-C.; Tan, M.; Chu, G.; Vasudevan, V.; Zhu, Y.; Pang, R.; et al. Searching for MobileNetV3. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
Qu, Z.; Cao, C.; Liu, L.; Zhou, D.-Y. A Deeply Supervised Convolutional Neural Network for Pavement Crack Detection with Multiscale Feature Fusion. IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 4890–4899. [Google Scholar] [CrossRef]
Fan, Y.; Hu, Z.; Li, Q.; Sun, Y.; Chen, J.; Zhou, Q. CrackNet: A Hybrid Model for Crack Segmentation with Dynamic Loss Function. Sensors 2024, 24, 7134. [Google Scholar] [CrossRef]
Chu, H.; Chun, P. Fine-Grained Crack Segmentation for High-Resolution Images via a Multiscale Cascaded Network. Comput.-Aided Civ. Infrastruct. Eng. 2024, 39, 575–594. [Google Scholar] [CrossRef]
Chen, B.; Fan, M.; Li, K.; Gao, Y.; Wang, Y.; Chen, Y.; Yin, S.; Sun, J. The PFILSTM Model: A Crack Recognition Method Based on Pyramid Features and Memory Mechanisms. Front. Mater. 2024, 10, 1347176. [Google Scholar] [CrossRef]
Hong, Z.; Yang, F.; Pan, H.; Zhou, R.; Zhang, Y.; Han, Y.; Wang, J.; Yang, S.; Chen, P.; Tong, X.; et al. Highway Crack Segmentation from Unmanned Aerial Vehicle Images Using Deep Learning. IEEE Geosci. Remote Sens. Lett. 2022, 19, 6503405. [Google Scholar] [CrossRef]
Zhang, T.; Qin, L.; Zou, Q.; Zhang, L.; Wang, R.; Zhang, H. CrackScopeNet: A Lightweight Neural Network for Rapid Crack Detection on Resource-Constrained Drone Platforms. Drones 2024, 8, 417. [Google Scholar] [CrossRef]
Al-maqtari, O.; Peng, B.; Al-Huda, Z.; Al-Malahi, A.; Maqtary, N. Lightweight Yet Effective: A Modular Approach to Crack Segmentation. IEEE Trans. Intell. Veh. 2024, 1–12. [Google Scholar] [CrossRef]
Wang, X.; Mao, Z.; Liang, Z.; Shen, J. Multi-Scale Semantic Map Distillation for Lightweight Pavement Crack Detection. IEEE Trans. Intell. Transp. Syst. 2024, 25, 15081–15093. [Google Scholar] [CrossRef]
Tao, H. Weakly-Supervised Pavement Surface Crack Segmentation Based on Dual Separation and Domain Generalization. IEEE Trans. Intell. Transp. Syst. 2024, 25, 19729–19743. [Google Scholar] [CrossRef]
Huang, Z.; Sui, Y. Contour-Weighted Loss for Class-Imbalanced Image Segmentation. In Proceedings of the 2024 IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, 27–30 October 2024; pp. 3084–3090. [Google Scholar]
Kervadec, H.; Bouchtiba, J.; Desrosiers, C.; Granger, E.; Dolz, J.; Ben Ayed, I. Boundary Loss for Highly Unbalanced Segmentation. Med. Image Anal. 2021, 67, 101851. [Google Scholar] [CrossRef] [PubMed]
Jiang, X.; Jiang, J.; Yu, J.; Wang, J.; Wang, B. MSK-UNET: A Modified U-Net Architecture Based on Selective Kernel with Multi-Scale Input for Pavement Crack Detection. J. Circuits Syst. Comput. 2023, 32, 2350006. [Google Scholar] [CrossRef]
Lo, S.-Y.; Hang, H.-M.; Chan, S.-W.; Lin, J.-J. Efficient Dense Modules of Asymmetric Convolution for Real-Time Semantic Segmentation. In Proceedings of the ACM Multimedia Asia, Beijing, China, 15 December 2019; pp. 1–6. [Google Scholar]
Han, W.; Zhang, Z.; Zhang, Y.; Yu, J.; Chiu, C.-C.; Qin, J.; Gulati, A.; Pang, R.; Wu, Y. ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context. arXiv 2020, arXiv:2005.03191. [Google Scholar]
Li, G.; Yun, I.; Kim, J.; Kim, J. DABNet: Depth-Wise Asymmetric Bottleneck for Real-Time Semantic Segmentation. arXiv 2019, arXiv:1907.11357. [Google Scholar]
Wang, Y.; Zhou, Q.; Wu, X. ESNet: An Efficient Symmetric Network for Real-Time Semantic Segmentation. In Pattern Recognition and Computer Vision, Proceedings of the Second Chinese Conference, PRCV 2019, Xi’an, China, 8–11 November 2019; Springer International Publishing: Cham, Switzerland, 2019. [Google Scholar]
Wang, W.; Su, C. Convolutional Neural Network-Based Pavement Crack Segmentation Using Pyramid Attention Network. IEEE Access 2020, 8, 206548–206558. [Google Scholar] [CrossRef]

Figure 1. Structure of MBConv and SE Channel Attention Mechanism.

Figure 2. Optimized Backbone Network Structure.

Figure 3. Haar Wavelet Down-sampling.

Figure 4. Carafe Up-sampling Module.

Figure 5. Trend of Dynamic Weight Variation.

Figure 6. Segmentation Results of Various Models on Representative Samples from the Crack500 Dataset.

Figure 7. Segmentation Results of Various Models on Representative Samples From the AsphaltCrack300 Dataset.

Figure 8. Heatmaps of the original model and the final model.

Table 1. Comparison of Evaluation Metrics for Various Network Models on the Crack500 Test Dataset.

Model	MIoU/%	F1-Score/%	Accuracy/%	Params/M	Flop/G	FPS
UNet	72.30	68.05	95.36	19.51	157.96	16.84
DeepLabV3+	74.87	71.26	96.41	40.29	61.51	41.68
HRNet	76.58	71.47	97.27	29.53	80.30	21.49
DDRNet	75.49	70.27	97.00	32.36	31.39	70.96
ContextNet	77.49	73.28	97.34	0.87	0.78	135.97
DABNet	78.89	75.49	97.53	0.75	4.64	108.38
EDANet	78.22	74.71	97.37	0.68	3.91	87.09
ESNet	78.65	75.49	97.49	1.66	11.82	72.04
MobileNetV3	78.70	75.38	97.41	2.98	1.16	119.16
LMM [31]	74.81	74.89	——	0.87	6.56	——
Wang et al. [41]	——	76.03	——	——	——	——
DSDGNet [33]	77.67	——	——	2.82	32.13	——
DepthCrackNet [8]	77.00	73.83	96.3	——	——	——
Ours	79.34	76.45	97.50	0.23	0.95	73.49

Table 2. Evaluation Metrics of Various Network Models Tested Directly on the AsphaltCrack300 Dataset.

Model	MIoU/%	F1-score/%	Accuracy/%	Params/M	Flops/G	FPS
UNet	44.53	44.21	67.81	19.51	157.96	16.84
DeepLabV3+	68.90	60.67	94.34	40.29	61.51	41.68
HRNet	52.00	48.40	77.85	29.53	80.30	21.49
DDRNet	62.51	52.72	91.92	32.36	31.39	70.96
ContextNet	66.06	59.37	91.95	0.87	0.78	135.97
DABNet	69.19	62.67	93.78	0.75	4.64	108.38
EDANet	65.03	58.77	91.04	0.68	3.91	87.09
ESNet	66.13	59.71	91.85	1.66	11.82	72.04
MobileNetV3	68.05	60.42	93.57	2.98	1.16	119.16
Ours	69.63	63.78	94.03	0.23	0.95	73.49

Table 3. Results of Loss Function Comparison Experiments.

Loss	Weighting Strategy	MIoU/%	F1-Score/%	Accuracy/%
CL	Fixed Weight	79.34	76.45	97.50
DL	Fixed Weight	79.14	76.03	97.51
FL	Fixed Weight	73.62	68.49	95.63
CL + DL	Fixed Weight	79.91	77.70	97.48
CL + DL + FL	Fixed Weight	79.80	77.37	97.45
CL + DL + FL	Dynamic Weight, $β$ ∈ [0.0,1.0]	79.97	77.86	97.35
CL + DL + FL	Dynamic Weight, $β$ ∈ [0.4,0.6]	79.90	77.62	97.42
CL + DL + FL	Dynamic Weight, $β$ ∈ [0.3,0.7]	80.07	78.00	97.43
CL + DL + FL	Dynamic Weight, $β$ ∈ [0.1,0.9]	80.10	77.78	97.50
CL + DL + FL	Dynamic Weight, $β$ ∈ [0.2,0.8]	80.37	78.22	97.59

Table 4. Ablation experiment Results.

Method	MIoU/%	F1-Score/%	Accuracy/%	Params/M	Flops/G	FPS
Baseline	77.29	72.60	97.28	3.59	2.14	107.35
+Pruning	71.88	65.19	96.49	0.13	0.35	151.54
Only + Carafe Up-sampling	78.66	75.32	97.36	0.18	0.81	81.19
Only + Skip Connection	74.69	69.74	96.74	0.13	0.36	147.40
Only + HWD	76.67	72.70	97.09	0.19	0.48	121.84
Only + Dynamic Combined Loss Function	75.56	71.63	96.55	0.19	0.48	121.84
+Carafe Up-sampling	78.66	75.32	97.36	0.18	0.81	81.19
+Skip Connection	79.14	76.01	97.42	0.18	0.82	80.77
+HWD	79.34	76.45	97.50	0.23	0.95	73.49
+Dynamic Loss Function	80.37	78.22	97.59	0.23	0.95	73.49

Table 5. Ablation Experiment Results After Excluding Pruning.

Method	MIoU/%	F1-Score/%	Accuracy/%	Params/M	Flops/G	FPS
Baseline	77.29	72.60	97.28	3.59	2.14	107.35
+Carafe Up-sampling	79.57	76.97	97.50	4.34	5.99	38.16
+Skip Connection	79.45	76.88	97.50	4.34	6.02	38.01
+HWD	79.52	77.18	97.63	6.47	7.45	33.26
+Dynamic Combined Loss Function	80.32	78.01	97.55	6.47	7.45	33.26

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yu, Y.; Xia, W.; Zhao, Z.; He, B. A Lightweight and High-Accuracy Model for Pavement Crack Segmentation. Appl. Sci. 2024, 14, 11632. https://doi.org/10.3390/app142411632

AMA Style

Yu Y, Xia W, Zhao Z, He B. A Lightweight and High-Accuracy Model for Pavement Crack Segmentation. Applied Sciences. 2024; 14(24):11632. https://doi.org/10.3390/app142411632

Chicago/Turabian Style

Yu, Yuhui, Wenjun Xia, Zhangyan Zhao, and Bin He. 2024. "A Lightweight and High-Accuracy Model for Pavement Crack Segmentation" Applied Sciences 14, no. 24: 11632. https://doi.org/10.3390/app142411632

APA Style

Yu, Y., Xia, W., Zhao, Z., & He, B. (2024). A Lightweight and High-Accuracy Model for Pavement Crack Segmentation. Applied Sciences, 14(24), 11632. https://doi.org/10.3390/app142411632

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Lightweight and High-Accuracy Model for Pavement Crack Segmentation

Abstract

1. Introduction

2. Related Work

3. Methods

3.1. Lightweight Fully Convolutional Neural Network

3.2. Haar Wavelet Down-Sampling

3.3. Carafe Up-Sampling

3.4. Dynamic Combined Loss Function

4. Results

4.1. Dataset

4.2. Implementation Details

4.2.1. Experimental Platform

4.2.2. Evaluation Metrics

4.3. Model Comparison Experiments

4.4. Loss Function Comparison Experiment

4.5. Ablation Studies

4.6. Limitations

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI