A Road Crack Segmentation Method Based on Transformer and Multi-Scale Feature Fusion

Xu, Yang; Xia, Yonghua; Zhao, Quai; Yang, Kaihua; Li, Qiang

doi:10.3390/electronics13122257

Open AccessArticle

A Road Crack Segmentation Method Based on Transformer and Multi-Scale Feature Fusion

by

Yang Xu

¹

,

Yonghua Xia

^2,*,

Quai Zhao

³,

Kaihua Yang

³ and

Qiang Li

³

¹

Faculty of Land Resources Engineering, Kunming University of Science and Technology, Kunming 650093, China

²

Department of Earth Science and Technology, City College, Kunming University of Science and Technology, Kunming 650233, China

³

Yunnan Geological Engineering the Second Investigation Institute, Kunming 650093, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(12), 2257; https://doi.org/10.3390/electronics13122257

Submission received: 24 May 2024 / Revised: 6 June 2024 / Accepted: 7 June 2024 / Published: 8 June 2024

(This article belongs to the Special Issue Computer Vision for Modern Vehicles)

Download

Browse Figures

Versions Notes

Abstract

:

To ensure the safety of vehicle travel, the maintenance of road infrastructure has become increasingly critical, with efficient and accurate detection techniques for road cracks emerging as a key research focus in the industry. The development of deep learning technologies has shown tremendous potential in improving the efficiency of road crack detection. While convolutional neural networks have proven effective in most semantic segmentation tasks, overcoming their limitations in road crack segmentation remains a challenge. To address this, this paper proposes a novel road crack segmentation network that leverages the powerful spatial feature modeling capabilities of Swin Transformer and the Encoder–Decoder architecture of DeepLabv3+. Additionally, the incorporation of a multi-scale coding module and attention mechanism enhances the network’s ability to densely fuse multi-scale features and expand the receptive field, thereby improving the integration of information from feature maps. Performance comparisons with current mainstream semantic segmentation models on crack datasets demonstrate that the proposed model achieves the best results, with an MIoU of 81.06%, Precision of 79.95%, and F1-score of 77.56%. The experimental results further highlight the model’s superior ability in identifying complex and irregular cracks and extracting contours, providing guidance for future applications in this field.

Keywords:

road crack segmentation; deep learning; semantic segmentation; transformer; multi-scale feature fusion module; attention mechanism

1. Introduction

Cracks represent a primary manifestation of road pavement distress. They not only compromise structural integrity but also give rise to ancillary issues [1]. Autonomous vehicles heavily rely on their ability to accurately interpret and understand the road environment. Failure to detect road cracks can pose significant risks to vehicle operation and passenger safety. Additionally, timely and accurate crack analysis before structural failure is crucial for ensuring the safe operation and sustainable development of transportation infrastructure [2]. Conventional inspection methods reliant on manual labor exhibit significant efficiency bottlenecks, consuming substantial human and material resources and struggling to meet the demands of large-scale, high-frequency road monitoring. Leveraging advanced computer vision and artificial intelligence technologies to achieve automated, high-precision road crack segmentation has emerged as a focal point of interest for researchers in this domain. In recent years, breakthroughs in deep learning algorithms have revolutionized image processing and recognition tasks [3]. Inspired by these advancements, some scholars have introduced deep learning methodologies into the field of crack visual detection, offering a fresh opportunity to supplant traditional image processing techniques and feature-based machine learning methods, thereby addressing the complexity of road crack segmentation [4]. This paradigm shift provides efficient, accurate, and cost-effective solutions for crack detection, propelling the intelligent and efficient development of road maintenance management practices.

FCN [5] is an early classic segmentation algorithm that incorporates multiple features through upsampling and skip connections to predict the entire input image using convolutional layers. Liu et al. [6] combined an extended FCN with a deep supervision network to propose a deep hierarchical convolutional neural network (CNN) architecture named DeepCrack, which segments crack images in an end-to-end manner. Yang et al. [7] fused contextual information into low-level features in a feature pyramid manner and simultaneously adjusted the weights of samples hierarchically during training to focus the model on samples. Liu et al. [8] improved crack pixel-level detection efficiency and accuracy by introducing multiple dilation modules and upsampling modules into the network. The Encoder–Decoder architecture has been widely applied in the pixel-level crack detection domain. Fan et al. [9] introduced an Encoder–Decoder structure with hierarchical feature learning ability and dilated convolutions, which enhances identification accuracy and detection sensitivity by extracting and fusing crack feature maps of different sizes and levels.

Convolutional Neural Networks (CNNs) are a type of deep learning model with core components including convolutional layers, pooling layers, and fully connected layers. By stacking multiple layers, CNNs can automatically learn multi-level features from images, enabling image classification and recognition. However, as the number of layers increases, traditional deep CNNs face issues of vanishing and exploding gradients. To address this problem, ResNet (Residual Network) was introduced. He et al. [10] proposed a residual learning framework, defining the output of each layer as the residual function of the input. Since its introduction, ResNet has been widely applied to various deep learning tasks. The detection algorithm for road cracks requires robustness to extract semantic features of cracks and accurate position prediction, especially for fine cracks. Deep Convolutional Neural Network (DCNN) methods, such as U-Net [11], construct deep feature maps through downsampling (e.g., max pooling). However, for tasks involving small targets or requiring precise boundary prediction, the semantic information decreases as it gradually combines with low-level features, leading to decreased localization accuracy, rendering it unsuitable for crack segmentation in complex scenes [12]. DeepLabv3+ [13] is a significant semantic segmentation model in the field of deep learning, proposed by the Google Brain team in 2018, which demonstrates outstanding performance in tasks involving complex scenes and multi-scale objects. Similar to U-Net, the method employed in this paper also uses the Encoder–Decoder architecture of DeepLabv3+. However, it includes a multi-scale coding module that is more suitable for crack segmentation. By incorporating skip connections and upsampling operations in the decoder, the method merges high-level and low-level features, thereby enhancing the handling of crack details and contextual information. Specifically, dilated convolutions are used to extract broader contextual information and expand the receptive field, which aids in better capturing global information when dealing with small targets like cracks while simultaneously reducing the loss of resolution. In contrast, U-Net uses max pooling for downsampling, which can lead to the loss of semantic information for small targets, thereby affecting localization accuracy. Ji et al. [14] proposed a method combining the DeepLabv3+ convolutional neural network with crack quantization algorithms, evaluating important indicators such as crack length, width, and area using crack quantization algorithms, showcasing the reliability of the DeepLabv3+ model for automatic crack detection. DeepLabv3+ also innovatively introduces an Encoder–Decoder structure, leveraging its unique architectural design and efficient feature fusion strategy, which has demonstrated extensive potential applications in scenarios such as autonomous driving and medical image analysis. Sun et al. [15] proposed a DMA-Net model, which adds a multi-scale attention mechanism to DeepLabv3+, dynamically allocating weights between high-level feature maps and low-level feature maps, exhibiting good performance in road crack segmentation scenarios. Among these, the Atrous Spatial Pyramid Polling module captures multi-scale contextual information in images through dilated convolutions with different dilation rates. However, it only uses a fixed dilation rate, resulting in a relatively small receptive field, which limits the effectiveness of capturing image information and restricts the information flow between feature maps. In contrast, the densely connected atrous spatial pyramid pooling (DenseASPP) [16] employed in this study enhanced the original ASPP’s independent branching mechanism by incorporating the idea of skip connections from DenseNet [17], utilizing a combination of atrous convolutional mechanisms with cross-layer connections. This approach reduces the loss of resolution while extracting more comprehensive feature information. It enhances the fusion and propagation of multi-scale features in different scenarios, optimizing the efficiency of computational resource utilization.

In the field of crack segmentation, CNN has been widely employed and has made significant strides. However, the shortcomings of CNN are also prominent. As illustrated in Figure 1, CNN reduces feature resolution to obtain a larger receptive field. Consequently, as the depth of the convolutional layers increases, the resolution of the convolutional feature maps decreases, resulting in a gradual reduction in spatial details. This limitation makes it relatively limited in handling long-range dependencies and global contextual information. In cases of complex crack morphologies, precise segmentation becomes challenging. In contrast, Transformer networks can retain semantic details and contour structures to a certain extent. Transformer [18], renowned for its unique architecture and attention mechanism, has demonstrated outstanding performance in natural language processing tasks. Inspired by its success, the Vision Transformer (ViT) [19] was introduced to the computer vision domain in 2020. ViT excels in capturing relationships between different regions within images, effectively enhancing efficiency and accuracy in image perception. It has been applied to address tasks such as object detection [20] and segmentation [21,22]. Unlike traditional convolutional neural networks that capture local image features through convolutional operations, ViT performs exceptionally well in tasks involving data sequences of various scales and lengths, indicating that feature extraction does not necessarily require a reduction in spatial resolution layer by layer. Unlike general image segmentation tasks, road crack segmentation models require specific adaptations to address the unique characteristics of cracks. Due to the typically narrow and irregular shapes of cracks, traditional segmentation models may struggle to effectively capture these fine structures. By opting for a transformer architecture, we can significantly enhance the model’s ability to recognize and segment cracks with complex shapes, thereby further improving the model’s adaptability to complex scenes. Guo et al. [23] proposed a Transformer-based crack segmentation model, achieving finer pixel-level crack detection results. Liu et al. [24] adopted a Transformer network structure similar to the SegNet [25] Encoder–Decoder and introduced novel attention modules capable of extracting contextual information and capturing long-range spatial information. Although ViT employs a global attention mechanism to capture contextual information from images, its architecture lacks hierarchical features compared to the road crack segmentation model proposed in this paper. This limitation affects its performance on tasks sensitive to scale variations, such as crack detection. For segmentation tasks that rely on precise spatial position and size information, a single-scale self-attention mechanism is inadequate for meeting the feature extraction needs across different regions. The proposed road crack segmentation method excels in complex scenes by modeling different scales and local regions through multiple self-attention units, ensuring appropriate attention intensity across various image regions. Wang [26] designed an artificial neural network specifically for identifying concrete cracks based on Transformer. Using a hierarchical Transformer encoder, the network outputs multi-scale features and adopts a top-down approach to gradually upsample through lateral connections, integrating features from the deepest encoder layers.

Swin Transformer [27], designed specifically for computer vision tasks, effectively captures local and global information through its innovative hierarchical design and shifted window mechanism. It is adept at modeling long-range dependencies and capturing contextual information in image data, demonstrating capabilities in handling multi-scale features and high-resolution images. Swin Transformer has surpassed CNN in image classification, object detection, and image segmentation and has been extensively applied in tasks such as autonomous driving [28,29], medical diagnosis [30,31,32], defect detection [33], showcasing its immense potential in computer vision. Swin Transformer has also been applied in crack detection. Liu et al. [34] proposed a method that integrates Swin Transformer blocks into a U-Net model to enhance the segmentation performance of concrete cracks. The advantage of Swin Transformers lies in their ability to capture long-range dependencies, thereby improving feature extraction. However, U-Net employs max pooling for downsampling, which leads to the loss of the spatially precise relationship between target pixels and their surrounding pixels. In contrast, the proposed method in this paper leverages the skip-connection mechanism of the DeepLabv3+ architecture to merge high-level and low-level information. Low-level features typically contain more spatial information and details, such as crack edges and contours, while high-level features capture more abstract semantic information. By combining these two types of features through skip connections, semantic information is preserved while details are restored, resulting in more accurate segmentation outcomes. Wang et al. [35] presented the SwinCrack model, which employs Swin Transformer blocks in the encoder, neck, and decoder parts of the model. Skip connections enhance the fusion of feature maps from the encoder to the decoder, improving crack detection performance. Although the model employs Swin Transformer blocks to enhance local feature extraction capabilities, the model proposed in this paper more effectively leverages dedicated multi-scale feature extraction modules to capture crack information at various scales. This functionality provides significant advantages for the segmentation of fine and irregularly shaped cracks. These studies collectively highlight the effectiveness of Swin Transformer in improving crack segmentation by leveraging its advanced feature extraction and long-range dependency modeling capabilities. This paper will further explore the application of Swin Transformers in crack segmentation.

In summary, when dealing with large-scale and highly complex data, it is crucial to ensure both the effective identification of crack edges and accurate segmentation results. Therefore, this study proposes a novel model with the following key contributions:

The proposed model employs Swin Transformer as the backbone network, leveraging its efficient hierarchical self-attention design to capture multi-scale information in images. This hierarchical architecture allows flexible modeling across various scales, maintaining computational efficiency while effectively capturing local features to identify fine and continuous cracks.
The integration of a DenseASPP structure enhances the learning of crack features at different scales. By merging multi-scale feature maps obtained from the Swin Transformer, the model captures rich contextual information across scales, improving the network’s dense multi-scale feature fusion capability and receptive field.
The introduction of a lightweight attention mechanism enhances the model’s ability to locate fine crack features. This mechanism strengthens the integration of information in the output feature maps, aiming to extract more precise boundaries and accurately depict local crack details, thereby improving road crack segmentation performance.

The rest of the paper is organized as follows: Section 2 describes the proposed road crack segmentation method, detailing the overall Encoder–Decoder architecture of the model, the structure and related mechanisms of the Swin Transformer as the backbone network, the characteristics and parameter information of the DenseASPP module, and the principles of the attention mechanism module. Section 3 presents the experimental results and analysis of the proposed method and provides an overview of the datasets used in this study, the training settings, and evaluation metrics. It also compares the proposed method with existing methods using the same dataset and presents ablation study results to demonstrate its superior performance. Finally, Section 4 summarizes our work, discusses its limitations, and outlines future research directions.

2. Materials and Methods

2.1. Network Architecture

This paper presents a Transformer model based on the DeepLabv3+ architecture, as illustrated in Figure 2, for road crack image segmentation.

The network follows the Encoder–Decoder architecture of the DeepLabv3+ model, where we introduce crucial components including the Swin Transformer backbone network, dilated convolutions, DenseASPP [16], and lightweight attention modules.

In the encoder part, the backbone network begins with input images processed through the Patch Partition module for block-wise partitioning. The 4 × 4 convolutional layers are implemented from both the Patch Partition and Linear Embedding modules. The notation S represents the hierarchical structure of stages in the Swin Transformer. The model architecture comprises four stages, where only Stage 1 solely contains Swin Transformer blocks. The remaining three stages include Patch Merging and Swin Transformer blocks, with the downsampling rates of feature maps denoted as {4×, 8×, 16×, 32×}. Cracks often span large areas, and the model captures long-range dependencies beneficial for extracting features of continuous and irregularly directed cracks. Following the backbone network is the DenseASPP module, which utilizes dilated convolution layers with rates of 3, 6, 12, 18, and 24. Each layer’s output is connected with all previous layers’ outputs, forming a densely connected feature pyramid, and includes a pooling layer to capture multi-scale contextual information. This enhances the learning and fusion capabilities of multi-scale features to accommodate the need for detailed expression in cracks of varying sizes and complex scenarios. The attention module, embedded after the DenseASPP module, enhances the integration of information in the output feature maps, aiding the model in accurately locating and segmenting crack boundaries.

In the decoder part, a feature fusion module is introduced using skip connections to integrate high-level abstract features with low-level detailed features, thus improving boundary localization accuracy. The decoder first reduces dimensionality and compresses channels of the feature output by the encoder through a 1 × 1 convolution. The high-level features from the attention mechanism module are upsampled and aligned with the lower-level features from the encoder. These are then fused to achieve an effective combination of shallow and deep features. The fused feature map is processed through convolution layers, and finally, a 1 × 1 convolution layer generates the probability distribution for each pixel’s corresponding class, completing the task of road crack segmentation.

2.2. Application of Swin Transformer in Road Crack Segmentation Scenarios

Road crack segmentation is an important task in road maintenance, as it helps detect surface damage on roads. DeepLabv3+ serves as a powerful semantic segmentation model based on CNN, achieving segmentation tasks through deep feature extraction and techniques like dilated convolutions. However, CNNs have inherent limitations in handling long-range dependencies and global contextual information. Additionally, they suffer from constraints such as limited receptive fields and spatial resolution loss, which can affect the model’s ability to detect complex and subtle road cracks effectively.

Therefore, when addressing the semantic segmentation task for road crack identification, this paper opts to replace the traditional convolutional neural network (CNN) backbone network in DeepLabv3+ with Swin Transformer [27]. Swin Transformer, as an innovator in the field of computer vision, employs self-attention mechanisms. It efficiently captures global and local information in images as well as long-range dependencies, hierarchically modeling information from local to global scales. It has demonstrated excellent performance in various computer vision tasks [20]. Compared to ViT [19], Swin Transformer also utilizes hierarchical and shifted window partitioning strategies to reduce computational complexity and memory consumption while maintaining high-resolution feature representations and understanding the contextual relationships within images. Therefore, by utilizing Swin Transformer as the backbone network, the model in this paper can better adapt to road cracks of different scales and shapes. It enhances the learning ability of features from small and complex crack shapes and improves segmentation boundary accuracy.

2.2.1. The Structure of Swin Transformer

Swin Transformer is a variant of the Transformer model, focusing on capturing spatial information in images. Swin Transformer adopts a hierarchical structure for hierarchical feature extraction, as illustrated in Figure 3. This figure is drawn with reference to Figure 3 in the reference [27]. It progressively reduces the spatial resolution of feature maps through multiple stages while increasing the number of channels to enhance expressiveness. Swin Transformer consists of Swin Transformer blocks, which are tailored for segmentation tasks by introducing the hierarchical concept of feature mapping and a shifting window mechanism.

Unlike typical convolutions, the Swin Transformer employs Patch Merging and a series of hierarchical designs, enabling the model to better capture and process both detailed and global feature information of cracks, thus adapting to crack features at different scales. The Patch Merging operation helps increase the receptive field while reducing computational complexity, improving upon traditional convolutional neural networks and making them more suitable for complex crack segmentation tasks. Firstly, the original images are partitioned into patches, with each patch linearly embedded into a high-dimensional vector space, similar to the embedding process of input words in the Vision Transformer (ViT) model. Furthermore, a crack image of size H × W is divided into multiple patches. Feature extraction is performed on each patch, and the resulting feature vectors are concatenated to form a comprehensive representation of the entire crack image. Here, H and W denote the height and width of the image, respectively, and C represents the number of channels. Swin Transformer’s patch-based processing enables the model to effectively handle crack objects of various sizes. Subsequently, different-sized feature maps are constructed through four stages to extract clearer crack boundaries, providing higher accuracy results for actual crack detection tasks. In this paper, we choose the Swin-B configuration to enhance the segmentation accuracy of complex and subtle cracks. Except for the initial Linear Embedding layer in Stage 1, the remaining three stages start with a Patch Merging layer for downsampling, followed by the stacking of Swin Transformer Blocks.

Through the Patch Merging layer, adjacent small patches are merged into larger patches, creating feature maps at the next level, akin to the downsampling effect in CNNs. This helps generate high-level abstract features with a larger receptive field, which are crucial for identifying subtle and complex crack boundaries. Consequently, it effectively extracts crack-related features from input images, enabling precise identification and localization of cracks. The architectural design and efficient feature extraction mechanism of Swin Transformer make it significant for road crack detection tasks.

2.2.2. Swin Transformer Block

Cracks often span large areas, and traditional CNNs, due to the limitations of convolutional kernels, primarily rely on local convolution operations, which restrict their ability to capture long-range dependencies. In contrast, the Swin Transformer, through its self-attention mechanism, can capture long-range dependencies within the input sequence, allowing it to better understand and integrate relationships between distant pixels in an image. This capability enhances precise segmentation in complex image scenarios, offering significant improvements over traditional convolutional methods.

Figure 4 illustrates the schematic diagram connecting two Swin Transformer modules. Within the Swin Transformer module, both window-based multi-head self-attention (W-MSA) and shifted window-based multi-head self-attention (SW-MSA) modules are sequentially employed. Before and after each (S)W-MSA module, there is a LayerNorm (LN), and the final MLP consists of two GELU non-linearities. Swin Transformer blocks are used with a doubling scheme of 2 in each stage [36].

The connection of Swin Transformer Blocks can be represented by the following Equation (1), where

{\hat{z}}^{l}

represents the output of (S)W-MSA,

z^{l}

represents the output of MLP, and

l

denotes the position of the Swin Transformer Block.

\{\begin{cases} {\hat{z}}^{l} = W-MSA (L N (z^{l - 1})) + z^{l - 1} \\ z^{l} = M L P (L N ({\hat{z}}^{l})) + {\hat{z}}^{l} \\ {\hat{z}}^{l + 1} = SW-MSA (L N (z^{l})) + z^{l} \\ z^{l + 1} = M L P (L N ({\hat{z}}^{l + 1})) + {\hat{z}}^{l + 1} \end{cases}

(1)

In the structural design of the Swin Transformer, the W-MSA module works in conjunction with the SW-MSA mechanism. This window-based strategy significantly enhances the effectiveness and efficiency of feature extraction. Unlike the global self-attention mechanism, this method divides the input sequence into a series of local windows, where only the information within each window is interactively computed. While W-MSA focuses solely on local information and cannot extract global features, researchers proposed the SW-MSA method. After executing W-MSA, the SW-MSA’s shifted window mechanism involves moving the partition window in a diagonal direction, facilitating the exchange and learning of attention information between windows. This allows for the capture of overall attentional information in the image. This method not only enhances the spatial modeling capability of the model but also optimizes the overall computational process, making the Swin Transformer more efficient. By introducing this mechanism into segmentation tasks such as crack detection, which require multi-scale information, the model is able to capture a variety of scale information from local to global within the image.

2.3. DenseASPP

Cracks exhibit diverse morphologies, making it crucial to accurately extract crack features at various scales for effective segmentation. Therefore, the detection model must effectively identify cracks of different scales under varying image conditions. Our method achieves this through a network structure with multi-scale receptive fields and a multi-scale feature fusion strategy, enhancing detection accuracy and allowing the model to flexibly adapt to the extraction of spatial features at different scales. In the presence of cracks with varying morphologies, distributions, and complex image backgrounds, the DenseASPP module processes input images in a multi-scale manner, enhancing the fusion of dense multi-scale features and promoting information sharing. This capability improves the segmentation of fine structures like cracks, and by increasing the receptive field while maintaining resolution, it prevents information loss.

ASPP is a parallel method of dilated convolution, but its drawback lies in using fixed dilation rates, which limits the flow of information between feature maps and reduces the efficiency of feature map utilization. Additionally, ASPP suffers from the problem of a small receptive field, making it difficult for the model to capture global information in images. ASPP aggregates information through pooling, but due to the relatively complex background environment of cracks, it is challenging to obtain accurate and complete crack segmentation results. Therefore, this operation is abandoned based on the characteristics of cracks, and the DenseASPP adopted in this paper is a dense connection-based dilated convolution method that can achieve dense connections between feature maps, thereby enhancing feature fusion and propagation.

Furthermore, the feature information extracted by Swin Transformer is combined with the DenseASPP module to further process the high-level feature output by the backbone network, capturing multiscale contextual information. This approach allows for the extraction of features at multiple levels and different receptive field ranges, enhancing the learning and fusion capabilities of multiscale features. Particularly for dense prediction tasks such as crack segmentation, it helps to better handle edge and detail regions, improving the handling of complex boundaries and multiscale targets in crack segmentation tasks.

2.3.1. The Denser Feature Pyramid

The original ASPP structure of DeepLabv3+ consisted of a 1 × 1 convolutional layer, a pooling layer, and three dilated convolutions with dilation rates of 6, 12, and 18, forming the feature pyramid. The DenseASPP used in this paper employs dilated convolutions with dilation rates of 3, 6, 12, 18, and 24. Compared to ASPP, DenseASPP involves more pixels in the computation of the feature pyramid. In DenseASPP, the dilation rate increases layer by layer, allowing upper-level convolutions to utilize lower-level features, resulting in denser pixel sampling. By tightly connecting layers, DenseASPP integrates features across multiple levels with different dilation rates, enabling feature extraction from inputs of various sizes and effectively fusing features at different dilation rates in a dense connection structure, thereby enhancing the model’s understanding and representation of multi-scale information.

The DenseASPP densely connects dilated convolutional layers with different dilation rates across the entire spatial dimension, forming a dense connection structure that enables comprehensive integration of low-level and high-level features. Figure 5a,b illustrate the structural differences between Residual Connection and Cross-layer Connection.

Equations (2) and (3), respectively, represent the output calculations of the Residual Connection structure and the Cross-layer Connection structure. In these equations,

x_{n}

denotes the output of the

n

layer in the Residual Connection structure, where

F (•)

is the nonlinear transformation function, and

x_{i n p u t}

is the input data of the layer. Similarly,

x_{m}

represents the output of the

m

layer in the Cross-layer Connection structure and

[x_{0}, x_{1}, \dots, x_{m - 1}]

is the concatenation of the outputs from all preceding layers, serving as the input to the current layer. This design allows the cross-layer connections to integrate feature representations from various levels within the network, thereby enhancing the overall performance and learning capability of the model.

Although both mechanisms transmit input signals from shallower layers to deeper networks, unlike residual networks that directly connect the current layer to the input layer, cross-layer connections reintroduce shallow features at a deeper level. This structure connects the output of the current layer with the output of each previous layer in the network, thus fully leveraging feature information from each layer.

x_{n} = F_{n} (x_{n} - 1) + x_{i n p u t}

(2)

x_{m} = F_{m} ([x_{0}, x_{1}, \dots, x_{m - 1}])

(3)

The Cross-layer Connection mechanism integrates information from different levels through cascading operations. The output from dilated convolutions with smaller dilation rates is combined with the multi-level features generated by the Swin Transformer backbone network. As the dilation rate gradually increases, each layer can receive feature information from different scales and spatial positions.

In one-dimensional dilated convolution operations, with a dilation rate parameter set to 6, the corresponding Effective Receptive Field (ERF) size is 13. Figure 6 references Figure 3 in reference [16]. As illustrated in Figure 6a, during the convolution process, only a subregion containing 3 pixels is selected for convolution calculation. This mechanism leads to sparse and lossy image information in the two-dimensional convolution process. To address this issue and effectively expand the network’s coverage and processing range of input data, the introduced Cross-layer Connection shares information. Figure 6b shows a one-dimensional dilated convolution layer with a dilation rate of 6, where a convolution layer structure with a dilation rate of 3 is set in the next layer. This structure increases the number of effective pixels involved in convolution calculation from the original 3 to 7. Further extending to two-dimensional dilated convolutions, this optimization effect becomes more prominent. As depicted in Figure 6c, under a dilation rate of 6 in a two-dimensional dilated convolution layer, setting a convolution layer with a dilation rate of 3 in the lower layer increases the number of effective pixels involved in calculation from the initial 9 to 49, significantly enhancing the network model’s spatial capturing capability and feature processing capability of image features. This contributes to improving the model’s performance in handling complex scenes, particularly in cases of blurred boundaries or significant differences in crack sizes, significantly improving the precision and accuracy of road crack segmentation results.

2.3.2. Larger Receptive Field

In crack segmentation tasks, the size of cracks can vary significantly. Typical convolutions have a fixed receptive field determined by the kernel size. Compared to increasing network depth or using large convolution kernels to expand the receptive field, the dilated convolutions in the DenseASPP module can increase the dilation rate by inserting gaps into the convolutional kernel. This approach expands the receptive field without reducing the resolution of the input feature map. Additionally, traditional convolutions often use multiple downsampling operations to expand the receptive field, which reduces spatial resolution and leads to the loss of local details. Dilated convolutions can enlarge the receptive field without pooling operations, maintaining a high-resolution description of crack details, and helping the model capture features of cracks with varying sizes.

Through Cross-layer Connection for multi-scale information sharing and connecting convolutional layers with different dilation rates in sequence, a dense feature stack is formed to obtain a larger high-resolution feature receptive field in images. In the ASPP structure, multiple convolutional branches do not share any information. In contrast, DenseASPP adopts the Cross-layer Connection approach, allowing effective transmission and integration of feature information between convolutional layers with different dilation rates. Direct information flow between convolutional layers with different dilation rates constructs a denser feature pyramid and effectively extends the network’s receptive field range. Through dilated convolutions, while maintaining the original resolution of the feature map, the receptive field can be significantly increased. For dilated convolutions with a dilation rate of

d

and a kernel size of

k

, the receptive field size is calculated by Equation (4).

R = (d - 1) \times (k - 1) + k

(4)

By stacking two convolutions, a larger receptive field can be obtained, and the size of the stacked receptive field can be calculated by the following Equation (5):

R = R_{1} + R_{2} - 1

(5)

R_{1}

and

R_{2}

, respectively, represent the receptive field sizes obtained by stacking two different dilated convolutional layers. Dilated convolutional layers with different dilation rates exchange information with each other, collectively expanding the overall receptive field range. By employing this DenseASPP structure, the model can more effectively capture contextual information in the image while enhancing feature reuse and deep fusion capabilities, thereby better adapting to cracks of various scales and shapes.

2.4. Lightweight Attention Mechanism Module

In this study, an attention mechanism module is incorporated into the model to enable it to focus more on features related to road cracks while suppressing unrelated features. SE-Net [37], a channel-based attention mechanism, is embedded into the model to enhance the integration of information from the output feature maps.

In the crack segmentation task, the morphology of cracks is diverse, and the background is complex. The self-attention mechanism of Swin Transformer focuses on the spatial dimension, capturing both local and global spatial relationships and contextual information. The SE module focuses on channel dimension attention, enhancing important features by weighting different channels. The combination of Swin Transformer and the SE module enhances the model’s ability to process features at different levels, resulting in richer and more comprehensive feature representations that can better handle the diverse morphology of cracks and background variations.

Accurately distinguishing between crack and non-crack regions often requires high sensitivity to subtle features and relies on specific feature channels (e.g., edge information, texture features). Enhancing the network’s edge perception capability is crucial. The SE module can weight the feature channels related to cracks, highlighting crack features and enabling the model to more accurately locate and segment crack boundaries, thereby improving the precision of segmentation results.

The SE-Net structure, as shown in Figure 7, mainly consists of the Squeeze and Excitation parts.

The Squeeze operation compresses and integrates the input feature map in the spatial dimension using global average pooling, thereby generating a global statistical feature vector for each channel. The Excitation operation then takes the globally compressed feature vector and feeds it into a fully connected layer for dimension reduction, producing a low-dimensional feature representation containing inter-channel interaction information. Subsequently, the ReLU activation function is applied, followed by another fully connected layer to restore and expand the feature dimension. Finally, a sigmoid function maps the result to the range between 0 and 1, serving as the weight for each channel. The weighted feature map is obtained by scaling these weights and multiplying them by the original feature map. Through this process, the SE attention mechanism can adaptively adjust the weights of each channel, enhancing relevant features for the target task while suppressing irrelevant features.

SENet increases the weights of important feature channels in this manner, improving the ability of target detection networks to acquire detailed information. The relevant mathematical expressions (6) are as follows:

\{\begin{cases} z_{c} = F_{s q} (X_{c}) = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} X_{c} (i, j) \\ s = F_{e x} (z, W) = σ (g (z, W)) = σ (W_{2} δ (W_{1 z})) \\ {\tilde{X}}_{c} = F_{s c a l e} (X_{c}, s_{c}) = X_{c} s_{c} \end{cases}

(6)

where

H

and

W

represent the height and width of the feature map;

z_{c}

is the globally compressed feature vector on channel

c

;

X

is the input feature map;

F_{s q}

is the Squeeze function;

F_{e x}

is the Excitation function;

σ

denotes the Sigmoid function;

g

represents the function of performing two fully connected operations;

δ

is the ReLU function;

s

denotes the obtained weight matrix;

F_{s c a l e}

is the channel weighting function; and

\tilde{X}

represents the output feature map.

By introducing attention mechanism modules, the model’s focus on important areas and features in the image is enhanced. The weights of each channel can be adaptively adjusted to better capture the details and structure of cracks in the image, effectively utilizing both global and local information in the image and balancing the allocation of attention.

3. Experiments and Results

3.1. Datasets

Datasets serve as the signal sources guiding deep learning models in understanding information. The performance and robustness of models heavily rely on the quantity and quality of training datasets. A dataset with a singular road background may lead to poor generalization of crack identification across different scenarios. Therefore, it is crucial to incorporate a large quantity of real and diverse learning samples. In this study, we amalgamated publicly available datasets featuring various types of road backgrounds to construct our dataset. This approach aims to enhance the practical applicability and significance of actual crack detection tasks in complex environments.

To enhance the generalization capability of the model, this paper combines publicly available datasets, including Crack500 [7], DeepCrack [6], Cracktree200 [38], CFD [39], Volker [40], Rissbilder [40], Eugen Miller [41], Non-crack [42], and GAP [43], to construct the dataset. The datasets are cropped to a size of 448 × 448, resulting in a total of 10,995 crack images. In this paper, the dataset is randomly divided from the established crack image dataset in the ratio of 6:2:2. Moreover, 6597 images are used in the model training, 2199 images are used in the validation, and 2199 images are used in the testing. The test weights utilized herein are those that yield the highest MIoU on the validation set.

Crack500 is a large-scale dataset for road surface crack detection, collected at the main campus of Temple University. It contains 500 high-resolution images of road cracks with annotations, primarily featuring asphalt backgrounds, occluded crack positions, and multiple noise sources. CrackTree200 comprises 206 images with various crack patterns and scenes, including shadows and low contrast backgrounds. The CFD dataset consists of 118 crack images, including high-resolution images with pixel-level crack annotations from real-world scenes, featuring shadows and different types of noise. DeepCrack contains 537 images designed specifically for deep learning crack detection, mainly featuring concrete surfaces with multiple shooting angles, occlusions, and noise points, providing high-resolution images with precise annotations. Rissbider and Volker datasets include various crack forms and types, with complex concrete road backgrounds and extensive noise used for evaluating crack detection algorithms. Eugen Miller includes various road types and crack scenes. Non-crack comprises various non-crack scenes used in training to prevent the model from overfitting to crack features. GAP is a German dataset focused on detecting defects in asphalt pavements, featuring dark shadows and low contrast images. Figure 8 illustrates several crack images along with their corresponding annotations.

3.2. Experimental Environment and Parameter Settings

The experiments were conducted on a system running Ubuntu 20.04 equipped with an NVIDIA Tesla V100 GPU. The deep learning framework PyTorch-GPU 1.10.1 [44] and CUDA version 11.1 were utilized for model training. During the training phase, the AdamW optimizer was employed with an initial learning rate of 6 × 10⁻⁵. The weight decay was set to 0.01, and the first and second moment estimates were set to 0.9 and 0.999, respectively. The batch size was set to 4, and the training mode was based on the number of iterations using the IterBased approach. The network was trained for a total of 160 k iterations. Under this configuration, the total training time for the proposed model is 29.1 h. Over 10 iterations, the average time to process a batch of data (including data loading and model forward inference) is 0.6580 s. The weights of the model with the highest accuracy metric were selected for testing. Both the CNN-based and Transformer-based backbone models were initialized using pre-trained weights to expedite convergence during training. The CrossEntropyLoss function was used as the loss function.

3.3. Model Evaluation Metrics

When evaluating the performance of semantic segmentation, various metrics are employed to measure the model’s performance. The metrics applied in this paper include commonly used ones such as Intersection over Union (IoU), Mean Intersection over Union (MIoU), Precision, Recall, and F1 score, which comprehensively assess various crack segmentation models, and their calculations are based on the confusion matrix.

Intersection over Union (IoU) is frequently used as a performance measure in semantic segmentation and object detection. It calculates the ratio of the intersection to the union between the predicted pixel region of the crack and the actual pixel region of the crack. A higher IoU value indicates a higher degree of overlap between the predicted result and the ground truth, reflecting better model performance. Equation (7) is as follows:

I o U = \frac{T P}{T P + F P + F N}

(7)

Mean Intersection over Union (MIoU) is the average of all class IoU values, providing a comprehensive evaluation of the model’s segmentation performance across all classes. A higher MIoU value indicates better segmentation accuracy across all classes, leading to overall superior performance. Equation (8) is as follows:

M I o U = \frac{1}{C} \sum_{C = 1}^{C} \frac{T P c}{T P c + F P c + F N c}

(8)

In this equation, TP is referred to as True Positive, FN represents False Negative, FP denotes False Positive, and TN is termed True Negative. TPc, FPc, and FNc, respectively, denote the quantities of True Positive, False Positive, and False Negative for the c-th class, where C indicates the number of classes.

Recall and Precision are commonly utilized metrics within the domain of deep learning. Recall measures the proportion of true positive instances among all positive samples, thereby quantifying the comprehensiveness of model predictions. It is defined by Equation (9) as follows:

R e c a l l = \frac{T P}{T P + F N}

(9)

Precision is a measure of the proportion of areas predicted by the model to be cracks (i.e., positive classes) that are truly cracks, with larger values indicating a higher likelihood that the model is correctly predicting all of the predictions in that class. Equation (10) is as follows:

P r e c i s i o n = \frac{T P}{T P + F P}

(10)

The F1 score is a metric widely used in the field of statistics as an alternative measure of accuracy for binary classification models. Introducing the F1 score combines recall and precision in a weighted harmonic mean. The larger the value, the more accurate the segmentation result. Equation (11) is given as follows:

F_{1} = 2 \times \frac{R e c a l l \times P r e c i s i o n}{R e c a l l + P r e c i s i o n}

(11)

3.4. Performance Comparison Experiments

To validate the model’s performance, this paper conducts comparative experiments involving multiple models, with results presented in Table 1. Evaluation metrics chosen for crack segmentation include Intersection over Union (IoU), Mean IoU (MIoU), Precision, Recall, and F1-score. FCN [5], U-Net [11], PSPNet [45], and DeepLabv3 [46] are all renowned semantic segmentation algorithms. Additionally, given the Transformer-based architecture of the proposed model, SegNeXt [47] and SegFormer [22], both based on Transformer architecture, are introduced for comparison. Furthermore, we included comparisons with K-Net [48], which is also based on the Swin Transformer architecture. Additionally, our model was compared with the representative crack segmentation algorithm DeepCrack [6].

Clearly, the model proposed in this study demonstrates superior performance in crack segmentation tasks, achieving an IoU of 63.34%, MIoU of 81.06%, Precision of 79.95%, Recall of 75.31%, and F1-score of 77.56%. On one hand, our approach surpasses CNN-based models, exhibiting improvements over the classical DeepLabv3 model by 2.01%, 4.38%, 1.36%, and 2.99% in MIoU, Recall, Precision, and F1-score metrics, respectively. Compared to U-Net, our model achieves a 7% increase in MIoU and a 9.95% improvement in F1-score. On the other hand, compared to the Transformer-based SegNeXt model, our model demonstrates enhancements of 1.73%, 0.89%, 1.9%, 0.63%, and 1.31% in IoU, MIoU, Recall, Precision, and F1-score, respectively. When compared to SegFormer, our model surpasses it by 2.91% in IoU, 1.5% in MIoU, 3.41% in Recall, and 0.84% in Precision, and achieves a 2.23% improvement in F1-score. Compared to the K-Net model, which also employs the Swin Transformer architecture, our model achieved a 1.17% higher MIoU and a 1.73% higher F1-score. In comparison to the representative crack segmentation algorithm DeepCrack, our model demonstrated an improvement of 2.38% in MIoU and 3.56% in F1-score. It is worth noting that all models utilizing the Transformer architecture outperform CNN-based models overall, thus validating the superiority of models employing the Transformer architecture. These results collectively indicate the superior performance of the proposed model in crack segmentation tasks.

To facilitate a direct comparison and analysis of the segmentation results of each model, we visualize the prediction outcomes of each model in this study, as depicted in Figure 9.

The results are evidently clear; for images with complex crack structures such as mesh-like patterns, our model produces more continuous and finely detailed crack segmentation results. It can be observed that the crack boundaries are more distinct, and the crack structures are more intact. Compared to SegNeXt and SegFormer, both based on the Transformer architecture, our model achieves superior performance in segmenting fine-grained and irregularly shaped cracks while ensuring the continuity of crack edges. It is capable of predicting various forms of cracks while extracting more detailed contours and clearer boundaries, showcasing its adaptability and generalization ability. In contrast, for larger crack shapes, the differences in prediction results among the models are negligible. However, BiSeNet V2 exhibits extensive missed detections for complex crack structures, while PSPNet and SegNeXt display noticeable segmentation discontinuities and incompleteness for smaller cracks, leading to partial missed detections. The U-Net model exhibits a lack of clarity in identifying crack edges, primarily due to the necessity for larger-sized patches to undergo multiple max-pooling operations. This process results in the loss of precise spatial relationships between target pixels and their surroundings. While the utilization of smaller patches enables the capture of local image information, the inclusion of background context may be insufficient. Such operations are not conducive to the fine segmentation of cracks against complex backgrounds. Furthermore, misclassifications occur in the U-Net model when there are occlusions by objects in the road background, leading to erroneous category differentiations.

This indicates that our model achieves the best crack segmentation results, effectively handling irregular cracks and generating finer predictions. Furthermore, it can segment cracks more accurately and comprehensively, surpassing other segmentation methods overall.

3.5. Ablation Study

To further validate the effectiveness of the proposed improvements, ablation experiments were conducted on the crack dataset, using the DeepLabv3+ model as the baseline (Group 1). The validation results of the models are shown in Table 2. Four sets of experiments were conducted under identical conditions, including models utilizing Swin Transformer as the backbone network, models incorporating lightweight attention mechanisms in addition to Swin Transformer, and models introducing DenseASPP on top of Swin Transformer and lightweight attention mechanisms. Through these improvements, the model achieved a 1.45% increase in IoU, a 2.01% increase in Recall, and a 1.55% increase in F1-score. Compared to CNN, employing Swin Transformer for multi-scale feature extraction resulted in a 1.78% increase in IoU. These improvements demonstrate the effectiveness of utilizing Swin Transformer as the backbone network. Furthermore, with the introduction of lightweight attention mechanisms and DenseASPP in experiments Group 3 and Group 4, the network’s feature extraction capabilities and information integration capabilities were enhanced, allowing the model to focus on more crack features. The network model achieved an IoU of 63.34%, Recall of 75.31%, and F1-score of 77.56% for crack segmentation, effectively improving the model’s performance in crack segmentation tasks and achieving higher segmentation accuracy. In summary, the network refined through three rounds of improvements exhibited optimal performance. The experiments confirmed the capability of the proposed enhancements in enhancing the effectiveness of road crack segmentation.

To validate the improvement in crack extraction achieved by the model, various forms of cracks were selected for visual analysis in this study, as depicted in Figure 10. It can be observed that for structurally complex mesh cracks, segmentation models using CNN as the backbone exhibit partial segmentation omissions. Conversely, in the model proposed in this research, the utilization of a Swin Transformer with a hierarchical window-based self-attention mechanism effectively enhances segmentation boundary accuracy. From the images, we can discern that the improved model excels in modeling long-distance dependencies and capturing contextual information, effectively capturing details such as crack structures in the images. In Group 3 and Group 4, by introducing SE modules and the dense multi-scale feature fusion capability of DenseASPP, the model further enhances its understanding of objects at various scales in the image and its perception of crack edges. It extracts and integrates features from multiple levels and different receptive field ranges, particularly demonstrating effective processing of edges and detail areas in complex and fine-grained crack segmentation tasks, leading to higher accuracy and robustness. In summary, for crack images of different orientations, sizes, and thicknesses, the improved model can produce more complete and continuous segmentation results.

4. Conclusions

The road crack segmentation model proposed in this paper leverages advanced computer vision techniques to provide pixel-level understanding of pavement cracks. This enables the model to make informed decisions and adaptive responses, aiding in the identification of potential hazards that could lead to accidents, thereby enhancing the overall safety of autonomous vehicles. By refining and optimizing the DeepLabv3+ algorithm, replacing the backbone network with Swin Transformer, and introducing DenseASPP and attention mechanisms, the accuracy of the network model in identifying cracks has been effectively improved.

Introducing Swin Transformer as the backbone network in the road crack identification scenario enhances the model’s ability to learn fine and complex crack features, thereby improving segmentation boundary accuracy. Leveraging the hierarchical feature extraction architecture of Swin Transformer improves the model’s understanding and utilization of global context, as well as its expression and localization capabilities for subtle crack features. By employing the Cross-layer Connection between different dilated convolutions in the original ASPP structure, subsequent layers utilize feature information of different scales from preceding layers, enhancing the network’s feature extraction capabilities for objects at various scales. The introduction of attention mechanisms in the network enables the model to focus more on features relevant to road cracks. By adaptively adjusting the weights of each channel, the inter-channel dependencies of feature maps are strengthened, further improving overall segmentation performance.

Comparative experiments demonstrate that models utilizing Transformer architecture outperform those based on CNN in overall metrics. The analysis results indicate that the improved network model yields highly accurate segmentation results. In the crack dataset, our model achieves an IoU of 63.34% and an F1-score of 77.56% for crack segmentation. In visualized test results, our model exhibits higher precision and robustness in addressing complex and irregular road crack segmentation tasks, highlighting the superiority of the network in accurate detection and extraction of road cracks.

Although the crack segmentation model proposed in this study demonstrates outstanding accuracy, there are still some limitations and areas for further improvement. The model’s demand for computational resources is relatively high, posing challenges for real-time processing on resource-constrained devices. Our future work will focus on exploring more efficient network architectures that maintain performance while reducing computational resource requirements, enabling real-time processing on embedded devices. Additionally, techniques such as semi-supervised learning can be utilized to reduce dependency on large amounts of labeled data, thereby lowering the cost and time required for model deployment. These improvements are expected to further enhance the practical application value of the crack segmentation model, providing reference solutions for crack detection work and more reliable technical support for infrastructure maintenance and safety inspection.

Author Contributions

Conceptualization, Y.X. (Yang Xu); methodology, Y.X. (Yang Xu) and Y.X. (Yonghua Xia); software, Y.X. (Yang Xu); validation, Y.X. (Yang Xu), Y.X. (Yonghua Xia) and Q.Z.; formal analysis, Y.X. (Yonghua Xia), Q.Z. and K.Y.; resources, Y.X. (Yonghua Xia), K.Y. and Q.L.; data curation, Y.X. (Yang Xu) and Q.L.; writing—original draft preparation, Y.X. (Yang Xu); writing—review and editing, Y.X. (Yonghua Xia), Q.Z. and K.Y.; visualization, Y.X. (Yang Xu); supervision, Y.X. (Yonghua Xia), Q.Z., K.Y. and Q.L.; project administration, Y.X. (Yang Xu). All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding author.

Acknowledgments

We thank the anonymous reviewers and members of the editorial team for their comments and contributions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Gavilán, M.; Balcones, D.; Marcos, O.; Llorca, D.F.; Sotelo, M.A.; Parra, I.; Ocaña, M.; Aliseda, P.; Yarza, P.; Amírola, A. Adaptive Road Crack Detection System by Pavement Classification. Sensors 2011, 11, 9628–9657. [Google Scholar] [CrossRef]
Hu, W.B.; Wang, W.D.; Ai, C.B.; Wang, J.; Wang, W.J.; Meng, X.F.; Liu, J.; Tao, H.W.; Qiu, S. Machine vision-based surface crack analysis for transportation infrastructure. Autom. Constr. 2021, 132, 103973. [Google Scholar] [CrossRef]
Alzubaidi, L.; Zhang, J.L.; Humaidi, A.J.; Al-Dujaili, A.; Duan, Y.; Al-Shamma, O.; Santamaría, J.; Fadhel, M.A.; Al-Amidie, M.; Farhan, L. Review of deep learning: Concepts, CNN architectures, challenges, applications, future directions. J. Big Data 2021, 8, 53. [Google Scholar] [CrossRef]
Hamishebahar, Y.; Guan, H.; So, S.; Jo, J. A Comprehensive Review of Deep Learning-Based Crack Detection Approaches. Appl. Sci. 2022, 12, 1374. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 3431–3440. [Google Scholar]
Liu, Y.H.; Yao, J.; Lu, X.H.; Xie, R.P.; Li, L. DeepCrack: A deep hierarchical feature learning architecture for crack segmentation. Neurocomputing 2019, 338, 139–153. [Google Scholar] [CrossRef]
Yang, F.; Zhang, L.; Yu, S.J.; Prokhorov, D.; Mei, X.; Ling, H.B. Feature Pyramid and Hierarchical Boosting Network for Pavement Crack Detection. IEEE Trans. Intell. Transp. Syst. 2020, 21, 1525–1535. [Google Scholar] [CrossRef]
Wenjun, L.; Yuchun, H.; Ying, L.; Qi, C. FPCNet: Fast pavement crack detection network based on encoder-decoder architecture. arXiv 2019, arXiv:1907.02248. [Google Scholar]
Fan, Z.; Li, C.; Chen, Y.; Wei, J.H.; Loprencipe, G.; Chen, X.P.; Di Mascio, P. Automatic Crack Detection on Road Pavements Using Encoder-Decoder Architecture. Materials 2020, 13, 2960. [Google Scholar] [CrossRef]
He, K.M.; Zhang, X.Y.; Ren, S.Q.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 27–30 June 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 770–778. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. In U-Net: Convolutional networks for biomedical image segmentation. In Proceedings of the 18th International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), Munich, Germany, 5–9 October 2015; Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Qu, Z.; Cao, C.; Liu, L.; Zhou, D.Y. A Deeply Supervised Convolutional Neural Network for Pavement Crack Detection with Multiscale Feature Fusion. IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 4890–4899. [Google Scholar] [CrossRef]
Chen, L.C.E.; Zhu, Y.K.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the 15th European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 833–851. [Google Scholar]
Ji, A.K.; Xue, X.L.; Wang, Y.N.; Luo, X.W.; Xue, W.R. An integrated approach to automatic pixel-level crack detection and quantification of asphalt pavement. Autom. Constr. 2020, 114, 103176. [Google Scholar] [CrossRef]
Sun, X.Z.; Xie, Y.C.; Jiang, L.M.; Cao, Y.; Liu, B.Y. DMA-Net: DeepLab with Multi-Scale Attention for Pavement Crack Segmentation. IEEE Trans. Intell. Transp. Syst. 2022, 23, 18392–18403. [Google Scholar] [CrossRef]
Yang, M.K.; Yu, K.; Zhang, C.; Li, Z.W.; Yang, K.Y. DenseASPP for semantic segmentation in street scenes. In Proceedings of the 31st IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 3684–3692. [Google Scholar]
Huang, G.; Liu, Z.; van der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the 30th IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 2261–2269. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Xiaohua, Z.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Lecture Notes in Computer Science (LNCS 12346). pp. 213–229. [Google Scholar]
Zheng, S.X.; Lu, J.C.; Zhao, H.S.; Zhu, X.T.; Luo, Z.K.; Wang, Y.B.; Fu, Y.W.; Feng, J.F.; Xiang, T.; Torr, P.H.S.; et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 19–25 June 2021; pp. 6877–6886. [Google Scholar]
Xie, E.Z.; Wang, W.H.; Yu, Z.D.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. In Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS), Virtual, 6–14 December 2021. [Google Scholar]
Jing-Ming, G.; Markoni, H. Transformer based refinement network for accurate crack detection. In Proceedings of the 2021 International Conference on System Science and Engineering (ICSSE), Ho Chi Minh, Vietnam, 26–28 August 2021; pp. 442–446. [Google Scholar]
Liu, H.J.; Miao, X.Y.; Mertz, C.; Xu, C.Z.; Kong, H. CrackFormer: Transformer network for fine-grained crack detection. In Proceedings of the 18th IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 3763–3772. [Google Scholar]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
Wang, W.J.; Su, C. Automatic concrete crack segmentation model based on transformer. Autom. Constr. 2022, 139, 104275. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.T.; Cao, Y.; Hu, H.; Wei, Y.X.; Zhang, Z.; Lin, S.; Guo, B.N. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the 18th IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 9992–10002. [Google Scholar]
Du, Y.F.; Zhang, R.Y.; Shi, P.C.; Zhao, L.F.; Zhang, B.; Liu, Y.M. ST-LaneNet: Lane Line Detection Method Based on Swin Transformer and LaneNet. Chin. J. Mech. Eng. 2024, 37, 14. [Google Scholar] [CrossRef]
Liu, Y.Z.; Wu, C.J.; Zeng, Y.T.; Chen, K.Y.; Zhou, S.J. Swin-APT: An Enhancing Swin-Transformer Adaptor for Intelligent Transportation. Appl. Sci. 2023, 13, 13226. [Google Scholar] [CrossRef]
Lin, A.L.; Chen, B.Z.; Xu, J.Y.; Zhang, Z.; Lu, G.M.; Zhang, D. DS-TransUNet: Dual Swin Transformer U-Net for Medical Image Segmentation. IEEE Trans. Instrum. Meas. 2022, 71, 15. [Google Scholar] [CrossRef]
Wei, C.; Ren, S.H.; Guo, K.T.; Hu, H.H.; Liang, J.M. High-Resolution Swin Transformer for Automatic Medical Image Segmentation. Sensors 2023, 23, 3420. [Google Scholar] [CrossRef]
Zhang, L.; Wen, Y.; Soc, I.C. A transformer-based framework for automatic COVID19 diagnosis in chest CTs. In Proceedings of the 18th IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 513–518. [Google Scholar]
Gao, L.F.; Zhang, J.X.; Yang, C.H.; Zhou, Y.C. Cas-VSwin transformer: A variant swin transformer for surface-defect detection. Comput. Ind. 2022, 140, 103689. [Google Scholar] [CrossRef]
Liu, J. Concrete crack segmentation using UNet algorithm with swin transformer block & CPAM. In Proceedings of the 2023 5th International Conference on Robotics, Intelligent Control and Artificial Intelligence (RICAI), Hangzhou, China, 1–3 December 2023; pp. 694–698. [Google Scholar]
Wang, C.; Liu, H.B.; An, X.Y.; Gong, Z.Q.; Deng, F. SwinCrack: Pavement crack detection using convolutional swin-transformer network. Digit. Signal Prog. 2024, 145, 104297. [Google Scholar] [CrossRef]
Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; Wang, M. Swin-Unet: Unet-like pure transformer for medical image segmentation. In Proceedings of the Computer Vision—ECCV 2022 Workshops: Proceedings, Tel Aviv, Israel, 23–27 October 2022; Lecture Notes in Computer Science (13803). pp. 205–218. [Google Scholar]
Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Wu, E.H. Squeeze-and-Excitation Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 2011–2023. [Google Scholar] [CrossRef]
Zou, Q.; Cao, Y.; Li, Q.Q.; Mao, Q.Z.; Wang, S. Crack Tree: Automatic crack detection from pavement images. Pattern Recognit. Lett. 2012, 33, 227–238. [Google Scholar] [CrossRef]
Shi, Y.; Cui, L.M.; Qi, Z.Q.; Meng, F.; Chen, Z.S. Automatic Road Crack Detection Using Random Structured Forests. IEEE Trans. Intell. Transp. Syst. 2016, 17, 3434–3445. [Google Scholar] [CrossRef]
Myeongsuk, P.; Sanghoon, K. Crack detection using fully convolutional network in wall-climbing robot. In Advances in Computer Science and Ubiquitous Computing. CSA-CUTE 2019; Lecture Notes in Electrical Engineering (LNEE 715); Springer: Singapore, 2021; pp. 267–272. [Google Scholar]
Ham, S.; Soohyeon, B.; Kim, H.; Impyeong, L.; Lee, G.-P.; Kim, D. Training a semantic segmentation model for cracks in the concrete lining of tunnel. J. Korean Tunn. Undergr. Space Assoc. 2021, 23, 549–558. [Google Scholar]
Dorafshan, S.; Thomas, R.J.; Maguire, M. Fatigue Crack Detection Using Unmanned Aerial Systems in Fracture Critical Inspection of Steel Bridges. J. Bridge Eng. 2018, 23, 15. [Google Scholar] [CrossRef]
Eisenbach, M.; Stricker, R.; Seichter, D.; Amende, K.; Debes, K.; Sesselmann, M.; Ebersbach, D.; Stoeckert, U.; Gross, H.M. How to get pavement distress detection ready for deep learning? In Proceedings of the A Systematic Approach, International Joint Conference on Neural Networks (IJCNN), Anchorage, AK, USA, 14–19 May 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 2039–2047. [Google Scholar]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.M.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An imperative style, high-performance deep learning library. In Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
Zhao, H.S.; Shi, J.P.; Qi, X.J.; Wang, X.G.; Jia, J.Y. Pyramid scene parsing network. In Proceedings of the 30th IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 6230–6239. [Google Scholar]
Chen, L.-C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
Guo, M.-H.; Lu, C.-Z.; Hou, Q.; Liu, Z.; Cheng, M.-M.; Hu, S.-M. Segnext: Rethinking convolutional attention design for semantic segmentation. Adv. Neural Inf. Process. Syst. 2022, 35, 1140–1156. [Google Scholar]
Zhang, W.W.; Pang, J.M.; Chen, K.; Loy, C.C. K-Net: Towards unified image segmentation. In Proceedings of the 35th Conference on Neural Information Processing Systems (NeurIPS), Virtual, 6–14 December 2021. [Google Scholar]
Yu, C.Q.; Gao, C.X.; Wang, J.B.; Yu, G.; Shen, C.H.; Sang, N. BiSeNet V2: Bilateral Network with Guided Aggregation for Real-Time Semantic Segmentation. Int. J. Comput. Vis. 2021, 129, 3051–3068. [Google Scholar] [CrossRef]

Figure 1. Visualization of feature maps based on CNN methods and Transformer methods. Conv denotes the convolutional mappings in different layers of the CNN network, while Stage refers to the feature maps corresponding to different stages in the Swin Transformer.

Figure 2. The overall structure of the model.

Figure 3. The architecture of a Swin Transformer (Swin-B). This figure is drawn with reference to Figure 3 in the reference [27].

Figure 4. Two successive Swin Transformer Blocks.

Figure 5. (a) Residual Connection and (b) Cross-layer Connection.

Figure 6. The illustration of dilated convolutions.

Figure 7. A Squeeze-and-Excitation block.

Figure 8. Dataset images of raw RGB images and binary labels.

Figure 9. Visualization comparison of crack segmentation results.

Figure 10. Visualization comparison of crack segmentation results.

Table 1. Performance comparison between the proposed method and various techniques.

Methods	IoU (%)	MIoU (%)	Precision (%)	Recall (%)	F1-Score (%)
FCN [5]	60.37	79.51	76.52	74.10	75.29
U-Net [11]	51.07	74.06	65.51	69.86	67.61
PSPNet [45]	59.01	78.82	78.76	70.17	74.22
DeepLabv3 [46]	59.45	79.05	78.59	70.93	74.57
DeepCrack [6]	58.73	78.68	78.48	70.01	74.00
K-Net [48]	61.07	79.89	79.03	72.87	75.83
SegFormer [22]	60.43	79.56	79.11	71.90	75.33
BiSeNet V2 [49]	54.78	76.60	72.71	68.96	70.79
SegNeXt [47]	61.61	80.17	79.32	73.41	76.25
Ours	63.34	81.06	79.95	75.31	77.56

Bold font indicates the best results.

Table 2. Ablation study results of improvement points.

Group	Swin Transformer	SE	DenseASPP	IoU (%)	Recall (%)	F1-Score (%)
1				61.89	73.30	76.01
2	✓			62.92	74.56	77.24
3	✓	✓		63.11	74.67	77.38
4	✓	✓	✓	63.34	75.31	77.56

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xu, Y.; Xia, Y.; Zhao, Q.; Yang, K.; Li, Q. A Road Crack Segmentation Method Based on Transformer and Multi-Scale Feature Fusion. Electronics 2024, 13, 2257. https://doi.org/10.3390/electronics13122257

AMA Style

Xu Y, Xia Y, Zhao Q, Yang K, Li Q. A Road Crack Segmentation Method Based on Transformer and Multi-Scale Feature Fusion. Electronics. 2024; 13(12):2257. https://doi.org/10.3390/electronics13122257

Chicago/Turabian Style

Xu, Yang, Yonghua Xia, Quai Zhao, Kaihua Yang, and Qiang Li. 2024. "A Road Crack Segmentation Method Based on Transformer and Multi-Scale Feature Fusion" Electronics 13, no. 12: 2257. https://doi.org/10.3390/electronics13122257

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Road Crack Segmentation Method Based on Transformer and Multi-Scale Feature Fusion

Abstract

1. Introduction

2. Materials and Methods

2.1. Network Architecture

2.2. Application of Swin Transformer in Road Crack Segmentation Scenarios

2.2.1. The Structure of Swin Transformer

2.2.2. Swin Transformer Block

2.3. DenseASPP

2.3.1. The Denser Feature Pyramid

2.3.2. Larger Receptive Field

2.4. Lightweight Attention Mechanism Module

3. Experiments and Results

3.1. Datasets

3.2. Experimental Environment and Parameter Settings

3.3. Model Evaluation Metrics

3.4. Performance Comparison Experiments

3.5. Ablation Study

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI