1. Introduction
Crack is one of the most common road surface diseases that pose a potential threat to highway safety. Regular crack detection plays a vital role in the maintenance and operation of existing buildings and infrastructure. Compared with the traditional manual visual inspection method, which is tedious, subjective, and time-consuming and exposes inspectors to dangerous working conditions [
1], the automatic crack detection method based on computer vision has been widely considered by academic and industrial circles for its advantages of being safer, cheaper, more efficient, and more objective.
Automatic crack detection is always a challenging task due to the influence of stains, shadows, complex texture, uneven illumination, blurring, and multiple scenes [
2]. In the past decades, scholars have proposed a variety of image-based algorithms to automatically detect cracks on concrete surfaces and pavement. In the early studies, most of the methods are based on the combination or improvement of traditional digital image processing techniques (IPTs) [
3], such as thresholding [
4,
5,
6] and edge detection [
7,
8,
9,
10]. However, these methods are generally based on the significant assumption that the intensities of crack pixels are darker than the background and usually continuous, which makes these methods difficult to use effectively in the environment of complex background noise [
11,
12]. In order to improve the accuracy and integrity of crack detection, the methods based on wavelet transform [
13,
14] are proposed to lift the crack regions. However, due to the anisotropic characteristics of wavelets, they may not deal well with cracks with large curvatures or poor continuities [
2].
In recent studies, several minimal path methods [
15,
16] have also been used for crack detection. Although these methods make use of crack features in a global view [
3] and achieve good performance, their main limitation is that seed points for path tracking need to be set in advance [
17], and the calculation cost is too high for practical application.
To improve the adaptability of IPTS-based methods in the real environment, methods based on machine learning (ML) have been used for damage detection by researchers, including artificial neural network (ANN) [
18,
19], support vector machine (SVM) [
20,
21,
22], random structure forest [
23], AdaBoost [
24], and so on. These methods have good performance but heavily rely on manual feature extraction.
More recently, the supervised deep learning methods, such as convolutional neural networks (CNNs), have achieved state-of-the-art performance in many advanced computer vision tasks, such as image recognition [
25], object detection [
26,
27], and semantic segmentation [
28,
29,
30]. The main advantage of deep learning is that it does not rely on expert-driven heuristic thresholds or hand-designed features and has high accuracy and robustness to image variations [
31].
Unet [
32], as a typical representative of semantic segmentation algorithm, has achieved great success in medical image segmentation. There are many similarities between pavement crack detection and medical image segmentation, so it is natural to apply Unet to pavement crack segmentation.
The spatial-channel squeeze and excitation (scSE) [
33] attention mechanism can enhance important information features while suppressing unimportant information features in space and channels [
34], which is helpful for improving the semantic segmentation effect.
Inspired by Unet and scSE, this paper proposed a U-shaped encoder–decoder semantic segmentation network for pavement crack detection combining Unet with ResNet and used the scSE attention module to enhance the crack detection effect.
The main contributions of this paper can be summarized as follows:
We modified Unet and proposed a residual U-shaped encoder–decoder semantic segmentation network that combined Unet with ResNet18, named RUC-Net, which achieved better detection effects than the original Unet and the other classical segmentation algorithms, such as FCN [
29] and SegNet [
30].
We integrated the scSE attention mechanism in RUC-Net. This attention module correlated the global information of cracks, effectively improving the detection effect. In addition, we experimentally compared and investigated the difference of detection performance improvement by using various scSE attention module combinations in the encoder part (downsampling stage) and the decoder part (upsampling stage).
We introduced the focal loss function, which could reduce the weight of easy-to-classify samples, to deal with the problem of class imbalance in crack segmentation.
The rest of the paper is organized as follows:
Section 2 reviews the previous work on pavement crack detection based on deep learning. Then, in
Section 3, we describe the network architecture of our model, loss function, and optimization method. Next, in
Section 4, we perform experimental vitrification and discuss our method. In addition, we provide ablation studies on the scSE module and the focal loss parameter choice in
Section 5. Finally, in
Section 6, we summarize our work and point out its limitations.
3. Proposed Method
Unet was originally designed for biomedical image segmentation, such as cell image segmentation and retinal image capillary segmentation. Although these biomedical image training datasets are generally small, Unet still achieves good segmentation results. Due to the high cost of data acquisition and marking, the dataset of crack segmentation images is usually small too. However, there are some similarities between the topological structures of crack images and biomedical images. In view of the above two points, the segmentation tasks of crack images and biomedical images have strong similarities. Therefore, the authors preferred the Unet-based network for crack image segmentation.
To further improve the segmentation performance of Unet, we first considered introducing residual modules in downsampling, which increased gradient propagation and helped to improve the generalization ability of the network. Second, we introduced the scSE attention mechanism, which could enhance important information features while suppressing unimportant information features in space and channels, so as to improve the semantic segmentation effect.
3.1. Network Architecture
The network we proposed was a residual U-shaped encoder–decoder semantic segmentation network, as shown in
Figure 1, called the Residual Unet Crack Network (RUC-Net). The encoder part of RUC-Net was a contraction path to capture contextual semantic information, which was modified from the encoder part of original Unet combined with Resnet18. For the encoder, we mainly modified the following:
The 7 × 7 convolution layer and the max pool layer at the front part of Resnet18 were removed, and the two 3 × 3 convolution layers at the front part of Unet were retained to change the number of channels from three to 64.
In the original Unet, after four downsamplings, the number of channels became 1024. In order to reduce the model parameters and computational complexity, unlike the original Unet, the final channel number of RUC-Net was 512 after four downsamplings. Therefore, the number of channels in the proposed network remained 64 after the first downsampling.
The 2 × 2 max pooling layer, which was used for downsampling, and two 3 × 3 convolution layers of the original Unet network were replaced by the residual block, which is inspired by Resnet. As shown in
Figure 2, each residual block contained two basic blocks. Each basic block contained two 3 × 3 convolutions and corresponding skip connections. In the first basic block, a 3 × 3 convolution with a stride of two was used for downsampling. A total of four residual blocks were used, and the last three residual blocks were equivalent to con3_x, con4_x, and con5_x in ResNet18. The first residual block, however, used 3 × 3 convolution with a stride of two for downsampling, which was different from conv2_x of the original ResNet18, which had no downsampling. After four times of downsampling, the resolution of the feature image changed to 1/16 of the original image.
The decoder part of RUC-Net was an extended path, which upsampled the feature map and improved the resolution of the feature map step by step. The feature map obtained by each upsampling was skip connected with the feature map in the corresponding downsampling path. This skip-connection technology reused the image details that may have been lost in the encoding layers and took into account both the global information and localization accuracy of the image, so that the decoding layers could reconstruct image details more effectively [
57].
3.2. scSE Module
Roy et al. [
33] proposed an scSE module, which had three variants: sSE (‘squeezes’ along the channels and ‘excites’ spatially), cSE (‘squeezes’ along the spatial domain and ‘excites’ along the channels), and scSE (concurrent sSE and cSE). Details of their structure can be found in the original article, and their principles are briefly described below.
The sSE module. The original feature map was changed from [C, H, W] to [1, H, W] via a 1 × 1 convolution, then activated by a sigmoid to obtain the spatial attention map, which was applied to the original feature map to recalibrate the spatial information.
The cSE module. The feature map was first changed from [C, H, W] to [C, 1, 1] by global average pooling, then converted to a C-dimension vector after twice performing 1 × 1 convolution operations. This vector was normalized by a sigmoid and was channelwise multiplied with the original feature map to obtain a feature map recalibrated by channel information.
The scSE module. The scSE was the combination of the sSE and cSE modules, which was essentially the parallel connection of the two modules. Specifically, after the feature map was operated through the sSE and cSE modules, we added up the two outputs to recalibrate the feature map both spatially and channelwise.
In this paper, we discuss the influence of various scSE modules or their combinations on the performance of crack detection in the downsampling and upsampling stages. The details are presented in
Section 5.
3.3. Loss Function
The loss function is a core component of deep learning methods that was used for measuring the deviation between the predicted values and the true values of models and usually served as an objective function of the model optimization. The essence of crack segmentation is to classify each pixel of the pavement image containing cracks as cracks or background. It is worth noting that compared with the pavement background, the cracked pixels only accounted for a small proportion of the whole pavement image. To solve this serious class imbalance problem, we chose focal loss [
67] as the loss function. Focal loss was modified based on standard cross-entropy loss. It introduced two penalty factors to reduce the weight of easy-to-classify samples, which made the model focus more on difficult-to-classify samples in the training process. The focal loss could be expressed as
where
α and (1 −
α) were used to control the proportions of positive and negative samples, respectively, with values ranging from [0, 1]. The parameter
γ is called the focusing parameter, and its value range was [0, +∞). When
γ = 0, focal loss degenerated into cross-entropy loss, and the larger
γ was, the greater the punishment for the easy-to-classify samples would be.
3.4. Parameter Optimization
In order to minimize the loss, the Adam optimizer was chosen to iteratively update the model parameters. The Adam optimizer is essentially RMSprop with momentum, which dynamically adjusted the learning rate of each parameter by using the first moment estimation and the second moment estimation of gradient. Its advantage was that after bias correction, the learning rate of each iteration had a certain range, which made the parameters stable. The update process could be simply represented as follows:
where
and
represent the exponential decay rates of first-order moment estimation and second-order moment estimation, which are set to 0.9 and 0.99, respectively;
t is the index of iterations;
represents the learning rate;
and
represent exponential moving averages of the first-order and second-order moments of the gradient, respectively; and
and
are the unbiased values of
and
, respectively.
represents the network model parameters that need to be updated by learning [
59].
6. Conclusions
In this paper, RUC-Net was proposed for pixel-level pavement crack segmentation. The architecture of RUC-Net was a U-shaped encoder–decoder network combining Unet and Resnet. The residual block in ResNet was used to replace the two 3 × 3 convolution layers in the encoder of original Unet, so as to extract more precise crack feature information. In the decoder network part, RUC-Net combined local information in shallow layers and semantic information in deep layers through concatenating to obtain more refined segmentation effects. In addition, we introduced the scSE attention module to enhance important information features while suppressing unimportant information features in space and channels, so as to further improve the crack segmentation effect. The focal loss function was used to deal with the class imbalance problem in crack segmentation. Our approach achieved an F1 score of 73.92% for the CFD dataset, 72.9% for the Crack500 dataset, and 84.61% for the DeepCrack dataset, outperforming FCN, Unet, and SegNet.
One limitation of this research was that our algorithm still needed to manually mark every pixel of the ground truth image, which made data acquisition expensive. To mitigate this issue, it was a research direction to adopt unsupervised learning-based techniques. As the supervised learning algorithm aimed to fit the function that approximated the given labeled training data, the actual performance of this kind of algorithm largely depended on the size and quality of the training dataset. So, establishing a wider, larger, and high-quality dataset and fully investigating data augmentation techniques are also directions we need to work on.