PSNet: Parallel-Convolution-Based U-Net for Crack Detection with Self-Gated Attention Block

Zhang, Xiaohu; Huang, Haifeng

doi:10.3390/app13179875

Open AccessArticle

PSNet: Parallel-Convolution-Based U-Net for Crack Detection with Self-Gated Attention Block

by

Xiaohu Zhang

and

Haifeng Huang

^*

School of Electronics and Communication Engineering, Sun Yat-sen University, Guangzhou 510275, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(17), 9875; https://doi.org/10.3390/app13179875

Submission received: 2 April 2023 / Revised: 25 April 2023 / Accepted: 26 April 2023 / Published: 31 August 2023

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Crack detection is an important task for road maintenance. Currently, convolutional neural-network-based segmentation models with attention blocks have achieved promising results, for the reason that these models can avoid the interference of lights and shadows. However, by carefully examining the structure of these models, we found that these segmentation models usually use down-sampling operations to extract high-level features. This operation reduces the resolution of features and causes feature information loss. Thus, in our proposed method, a Parallel Convolution Module (PCM) was designed to avoid feature information loss caused by down-sampling. In addition, the attention blocks in these models only focused on selecting channel features or spatial features, without controlling feature information flow. To solve the problem, a Self-Gated Attention Block (SGAB) was used to control the feature information flow in the attention block. Therefore, based on the ideas above, a PSNet with a PCM and SGAB was proposed by us. Additionally, as there were few public datasets for detailed evaluation of our method, we collected a large dataset by ourselves, which we named the OAD_CRACK dataset. Compared with the state-of-the-art crack detection method, our proposed PSNet demonstrated competitive segmentation performance. The experimental results showed that our PSNet could achieve accuracies of 92.6%, 81.2%, 98.5%, and 76.2% against the Cracktree200, CRACK500, CFD, and OAD_CRACK datasets, respectively, which were 2.6%, 4.2%, 1.2%, and 3.3% higher than those of the traditional attention models.

Keywords:

convolutional neural network; image segmentation; crack detection; U-Net; crack segmentation

1. Introduction

The increase in traffic volume has caused different degrees of cracks on roads, which lead to traffic accidents. At present, road crack detection usually uses manual identification, which is slow, risky, and costly. However, along with the development of image processing, the automatic processing of road crack images by using designed models could effectively solve these problems.

At present, there are many algorithms for pavement crack image detection, such as the algorithm based on multi-level filtering proposed by Talab [1], the crack detection algorithm based on wavelet transform proposed by Zhong [2], and the algorithm based on a support vector machine (SVM) proposed by Marques [3]. Although these methods are more efficient than manual identification, they are sensitive to lights and shadows. With the development of deep learning, image recognition using convolutional neural networks (CNNs) has been applied in various fields. For example, Attard [4] proposed a crack detection method with the R-CNN model. In his model, the locations and classifications of cracks could be detected. Although it achieved better results than the methods without deep learning, his model could not detect unobvious cracks. In addition, his model had very high computational complexities. Aiming to improve these problems, Nie [5] used a one-stage detection model in target detection fields; thus, a method based on YOLO v3 [6] for pavement crack detection was designed.

However, some application scenarios needed to calculate the proportion of the damaged areas, such as pavement damage area calculation. To solve the problem, some researchers proposed their methods. Qiao [7] enhanced crack datasets and improved the U-Net model (a symmetrical U-shaped image-segmentation network model) to design a segmentation model. In his model, every pixel of the image could be tagged as crack or non-crack classes. In addition, Dai [8] proposed a model through transfer learning. In his method, general images were used to learn general features, and crack images were used for fine-tuning to learn crack features. Moreover, Qu [6] used feature pyramids for pavement crack detection. In his model, the deep semantic information was integrated into the low-level convolution stage, layer-by-layer, through a new multi-scale feature fusion module. However, these methods had some evident drawbacks, which included the interferences of lights and shadows, which would lead to deviations in the prediction results since these interferences could be easily recognized as cracks by the models above.

For the purpose of alleviating the interferences of lights and shadows, an attention mechanism was proposed by new researchers. For example, Li [9] proposed a combination of channel attention mechanisms and fully convolutional neural networks. Xiang [10] proposed a novel pavement crack detection method based on the end-to-end model and trainable deep convolutional neural networks. He built the network with an encoder–decoder architecture and used a pyramid attention module to exploit global context information for the complex topological structures of cracks.

However, these models above had two evident drawbacks:

(1): These models usually used down-sampling operations to extract high-level features and recover the resolution of these features via up-sampling operations to obtain the final output. However, the down-sampling operations would reduce the resolution of features and cause feature information loss.
(2): These attention-based models usually used serial structures, which is to say that they mainly focused on selecting channel features or spatial features without controlling the feature information flow. Thus, crack-related features could not be filtered by these attention-based models.

In this paper, a Parallel Convolution Module (PCM) was designed to avoid feature information loss caused by down-sampling. Thus, more low-level features could be used for the final classification. In addition, a Self-Gated Attention Block (SGAB) was used to control the feature information flow. Thus, features related to cracks would be permitted to flow to the top layers, and the features unrelated to cracks would be suppressed. Finally, a Parallel-Convolution-based U-Net (PSNet) with a PCM and SGAB was proposed by us, and experiments were designed to verify the effectiveness of our proposed model.

The structure of this article is as follows. Section 2 introduces some basic information of the datasets used in our experiments and the model structure of our proposed PSNet. In addition, the PCM and the SGAB proposed by us are also introduced. Meanwhile, data augmentation strategies and the Swish activation function are described. Section 3 introduces some experiments of our proposed method, and the result analyses are also presented. Section 4 is the conclusion of the whole article.

2. Materials and Methods

2.1. Datasets

In order to evaluate our proposed method, we evaluated our method with four datasets: the Cracktree200 [11], CFD [12], Crack500 [13], and OAD_CRACK datasets. Detailed descriptions of these three datasets are shown as follows:

The Cracktree200 dataset—The Cracktree200 dataset was mainly collected on road pavements. This dataset was a visible-light dataset containing various kinds of cracks in complex interference environments such as shadows, occlusion, low contrast, noise, and other interferences. It contained 206 crack images, sized 800 × 600. For the Cracktree200 dataset, 80% of the images were used for training, and 20% of the images were used for testing.

The CFD dataset—The CFD dataset consisted of 118 images, sized 480 × 320 pixels. Each image had manually labeled crack contours. The device used to acquire the images was an iPhone 5 with a focus of 4 mm, an aperture of f/2.4, and an exposure time of 1/135 s. For the CFD dataset, 80% of the images were used for training and 20% of the images were used for testing.

The Crack500 dataset—The Crack500 dataset was a pavement crack dataset, including 3368 images captured using a cell phone on the main roads of Temple University, USA, sized either 1440 × 2560 or 2560 × 1440 pixels. Additionally, this dataset was divided into training datasets and testing datasets by the author. It had 1896 images in the training dataset and 1124 images in the testing dataset.

The OAD_CRACK dataset—As there were few crack images in the public datasets above, we collected a dataset by ourselves, which we named the OAD_CRACK dataset. This dataset had 5000 images, each with a 1920 × 1080-pixel resolution. All images were taken by us with a Huawei P30. Some sample images of the OAD_CRACK dataset are shown in Figure 1:

In this dataset, crack images were taken in Shenzhen, which were divided into four classes: linear crack, circular crack, void, and background. All images were manually annotated by the EISeg (an image segmentation labeling tool provided by Baidu) and labeled as 0, 1, 2, and 3, respectively. For the purpose of preventing the over-fitting problem during training process, these images were transformed with light and shadow to create more varied training data. Through the transformation above, the whole dataset was extended into 30,000 images. Finally, we divided the whole dataset into two parts; 70% of the images were used for training and 30% of the images were used for testing.

2.2. Model Structure

The structure of our proposed Parallel-Convolution-based U-Net (PSNet) is shown in Figure 2. It is noted that our PSNet has two parts. Part one is the encoder structure, and part two is the decoder structure.

As shown in Figure 2, it can be seen that the PSNet proposed by us is a fully convolutional neural network. Firstly, it is a symmetric U-shaped structure, with the encoder on the left side and the decoder on the right side. Carefully examining the PSNet, there are several Parallel Convolution Modules (PCMs) and several Self-Gated Attention Blocks (SGABs) in the encoder. The main purpose of the PCM is to avoid feature information loss caused by down-sampling and the main purpose of the SGAB is to control the flowing of features, which is highlighting features related to cracks and decreasing the weight of features unrelated to cracks. Additionally, it is noted that the SGAB could choose features in two dimensions, the channel dimension as well as the spatial dimension. Moreover, traditional convolution layers are used in the encoder to connect the PCM and the SGAB while extracting abstract high-level features. As for the decoder, we could see that several convolution layers and deconvolution layers are used to decode features generated from the encoder for the purpose of generating the label map.

2.3. The Parallel Convolution Module

At present, existing convolution-based segmentation models usually use down-sampling operations to reduce the resolution of features, and then recover the resolution of features by up-sampling to obtain the final output. These models are linear because they often rely on a serial symmetric encoding–decoding structure. However, feature information loss caused by down-sampling could not be avoided by these structures.

To solve the problem, we propose a Parallel Convolution Module (PCM). Figure 3 shows the structure. Firstly, input features would be input into block 1 for initial feature extraction, then the output of block 1 would be input into block 2. It is noted that block 2 would extract features with two kinds of resolutions, because it sets the stride of the first convolution layer of the two branches into 1 and 2, respectively. Thus, feature maps with resolution of 1/2 and feature maps with original resolution would be generated. After that, these two kinds of feature maps would be input into block 3. Additionally, through setting the stride of the first convolution layer of the three branches into 1, 2, and 2, respectively, block 3 would generate feature maps with resolutions of 1/2, 1/4, and 1/1. Finally, these three kinds of feature maps would be input into block 4. Through setting the stride of the first convolution layer of the four branches into 1, 2, 2, and 2, respectively, block 4 would generate feature maps with resolutions of 1/2, 1/4, 1/8, and 1/1. Therefore, feature maps with different resolutions would be acquired through continuous fusion of four blocks, which would effectively avoid feature information loss caused by down-sampling. In addition, these feature maps generated from block 4 would be up-sampled into same size and would be added together to input into the next layer.

2.4. The Self-Gated Attention Block

Gating mechanisms have been successfully deployed in some recurrent neural network architectures, for the reason that they can control the chosen features. Figure 4 shows the structure of the Self-Gated Attention Block (SGAB). From Figure 3, we can see that the Self-Gated Attention Block could be divided into three units: The first unit is the Channel Attention Gate Unit, which is used to rearrange the features of different channels and select channel features related to cracks through channel gate. The second unit is the Spatial Attention Gate Unit, which is used to rearrange spatial features and select spatial features related to cracks through spatial gate. It is noted that the channel gate and spatial gate are based on our proposed Gate Convolution Filter. The third unit is the Fusing Unit, which is used to concatenate these features generated from the first unit and the second unit. Detail descriptions are shown as follows:

2.4.1. Channel Attention Unit

Firstly, we define the input feature map as F. For the purpose of extracting global features, a Global Max Pooling operation (GMP) is used to calculate a global feature C1 with a size of 1 × 1 × C (where C represents the number of channels). Then, C1 is input into a Gate Convolution Filter (GCF) to generate a weight map D1. A detailed description of the GCF is shown in Section 4. In addition, a Global Average Pooling operation (GAP) is used to calculate a global feature C2 with a size of 1 × 1 × C. Additionally, C2 is input into another Gate Convolution Filter (GCF) to generate a weight map D2. The purpose of the GCF is to control feature information flowing into the following layers. After that, C1 and C2 are added together to generate a concatenated channel attention weight map E1. Finally, the input feature map F is convolved by E1 to generate the final output G1. Through convolving with E1, the weight of the channel feature related to cracks would be increased and the weight of the channel feature unrelated to cracks would be decreased. The whole process of the Channel Attention Unit can be calculated as:

Z (F) = F \otimes [G C F (G A P (F)) \oplus G C F (G M P (F))]

(1)

where Z(F) refers to the process of Channel Attention Unit, F refers to the input feature map, GCF refers to the Gate Convolution Filter, GAP refers to the Global Average Pooling operation, GMP refers to the Global Max Pooling operation,

\otimes

refers to the convolution operation, and

\oplus

refers to the add operation.

2.4.2. Spatial Attention Unit

The Spatial Attention Unit is paralleled to the Channel Attention Unit. Firstly, features F are input into the Max Pooling layer (MP) and the Average Pooling layer (AP), respectively, to generate two feature maps F1 and F2. Then, these two feature maps are input into the Gate Convolution Filter (GCF) to generate two spatial attention maps T1 and T2. After that, T1 and T2 are added together to generate a concatenated spatial attention weight map E2. Finally, the input feature map F is convolved by E2 to generate the final output G2. Through convolving with E2, regions associated with cracks are given higher weights, and the remaining regions are given lower weights. The whole process of the Spatial Attention Unit can be calculated as:

K (F) = F \otimes [G C F (M P (F)) \oplus G C F (A P (F))]

(2)

where K(F) refers to the process of Spatial Attention Unit, F refers to the input feature map, GCF refers to the Gate Convolution Filter, AP refers to the Average Pooling operation, MP refers to the Max Pooling operation,

\otimes

refers to the convolution operation, and

\oplus

refers to the add operation

2.4.3. Fusing

In the Fusing step, the feature map generated from the Channel Attention Unit and the Spatial Attention Unit were added together to generate the final output feature map U. It can be seen that U not only controls features flowing in the channel dimension, but also it controls features flowing in the spatial dimension.

2.4.4. The Gate Convolution Filter

The Gate Convolution Filter (GCF) is different from the ordinary convolution filter. The GCF is divided into two parts. The first part was the traditional convolution operation, which is defined as:

X_{l} = X_{l - 1} * W + b

(3)

where X is the input feature map of layer l, W is the convolution weight filter, and b is the weight bias. However, unlike the traditional convolution operation, the second part was the gating part, which is defined as:

X_{l} = \tanh (X_{l - 1} * V)

(4)

where X is the input feature map of layer l, and V is trainable weight. Additionally, the tanh function here was used to forget unimportant features. Then, features generated from (3) were convolved with (4). Through operation from (4), features related to cracks were permitted to flow to the top layers and features unrelated to cracks were suppressed, which effectively controlled feature flowing.

2.5. Data Augmentation

Data augmentation can be effectively used to train the deep learning models in some applications. Generally, some of the simple transformations applied to the image of data augmentation are geometric transformations such as flipping, rotation, translation, cropping, and scaling. Here, we used rotation and translation for our data augmentation strategies.

Angle Rotation: Traditionally, angle rotation is used to augment training datasets through rotating images into different angles. Usually, crack images are invariant to random rotation; thus, we rotated all images at the degree angle of 10, 30, 60, 90, 110, 140, and 170, respectively, to generate new images.

Image Cropping: Generally, the image cropping method amplifies training datasets through cropping images. Usually, crack images do not have regular shapes; thus, we chose to randomly crop and recombine all images of the training datasets.

2.6. Swish Activation Function

The most widely used activation function is the Rectified Linear Unit (ReLU), which is defined by ReLU(x) = max(0,x). This design can not only avoid gradient disappearance, but also it can drop negative features and decrease the computational complexity. However, these negative features might contain some useful information. Thus, this operation would lead to information loss. The ReLU activation function is shown in Figure 5a.

To solve the problem, Google proposed a Swish [14] activation function. The Swish activation function has the characteristics of no upper bound and lower bound, and being smooth and non-monotonic. In the Swish activation function, all negative inputs would be compressed through a special function to generate a minor output. Thus, negative feature could be maintained. The Swish activation function can be calculated as:

S w i s h (x) = x * s i g m o i d (x)

(5)

where x is the input features, and Swish(x) is the Swish activation function.

3. Results and Discussion

3.1. Experimental Setup

In our experiment, our training data were firstly normalized and then input into the PSNet. In the PSNet, we set the kernel size of all convolution layers equals to 3 × 3 and the number of these convolution filters in the encoder was 64, 64, 128, 128, 256, 256, 512, 512, and 1024, respectively. As for the convolution layers in the decoder, we set the kernel size of every layer into 3 × 3, and the number of these convolution filters was 512, 512, 256, 256, 128, 128, 64, and 64, respectively. In addition, in the PCM, we set the filter size of convolution layers as 3 × 3, and the number of these convolution filters was 64. Additionally, we used a Stochastic Gradient Descent (SGD) optimizer to train our PSNet.

We used the accuracy as the evaluation criterion for our experiment. The definition of accuracy is the number of correct samples divided by the number of samples in the test datasets. The formula can be expressed as follows:

A c c = \frac{T P + T N}{S_{A l l}}

(6)

where Acc represents the accuracy criterion, TP represents the number of positive samples correctly identified and TN represents the number of negative samples correctly identified, and

S_{A l l}

represents the number of samples in the test datasets.

3.2. Comparison with the State-of-the-Art Methods

In order to evaluate the performance of our proposed PSNet, we conducted an experiment. The image segmentation accuracy was the main indicator of our measurement. In order to demonstrate the effectiveness of our method, several traditional machine-learning-based models and attention-based convolution models were used as baseline algorithms.

Before our comparison, we provided some relevant explanations for these mainstream algorithms:

FCN—a fully convolutional neural network;
ConvNet—a deep convolutional neural network;
Split-Attention Network—a channel-wise attention-based network;

Cascaded Attention DenseUNet—an attention-based network with global attention and core attention;

ECA-Net—a lightweighted channel attention-based convolutional neural network;
DWTA-UNet—a U-Net-based network with discrete wavelet transformed image features.

As shown in Table 1, it can be seen that compared with the mainstream attention-based convolution networks, DWTA-UNet, ECA-Net, Cascaded Attention DenseUNet, and Split-Attention Network, our proposed PSNet can achieve much better segmentation accuracy on all datasets. Thus, it strongly proves the effectiveness of our method, which is the Parallel Convolution Module (PCM) and the Self-Gated Attention Block (SGAB). For this reason, different resolutions of features are maintained through the continuous fusion of four blocks of the PCM, which could avoid information loss caused by down-sampling in the CNN model. Additionally, the SGAB could control feature flowing in two dimensions, which could select features related to cracks and suppress those unrelated features. Compared with the U-Net model, it was noticed that attention-based models such as ECA-Net, Cascaded Attention DenseUNet, Split-Attention Network, and DWTA-UNet could acquire better performance since these models use attention blocking to decrease the interference of shadow and lights. Compared with these fully convolutional neural networks such as FCN and ConvNet, the U-Net model can achieve a higher performance since the U-Net model not only has a deeper layer, but also has a feature fusion policy which can achieve more discriminative features. Compared with SVM and CrackForest, FCN can achieve a relatively better result, for the reason that deep learning models can automatically extract high-level features, which is more generalized than handcrafted features.

Here, we present some examples of our PSNet’s detection results for crack images, which were taken from the CRACK500 datasets. These crack images and their detection results are shown in Figure 6.

3.3. Effects of Using Different Activation Functions

For the purpose of evaluating the effect of using different activation functions, we conducted an experiment on three public datasets. Here, we used Sigmoid, Tanh, Rectified Linear Unit (ReLU), Parametric Rectified Linear Unit (PReLU), and Swish in our experiment, which is shown in Table 2.

From the experimental results, it can be seen that the Swish activation function can achieve the best result. This is because the Swish activation function can compress the negative feature information and retain these features in the next layer.

Additionally, we could see that the Swish activation function could achieve better results than the PReLU and activation function. This is because the Swish activation function is very smooth, which allows the output features to change continuously with the input features, and, therefore, it would not change the distribution of the input feature.

Compared with the ReLU activation function, PReLU can achieve a better performance because the PReLU activation function uses a specific linear function to compress negative features, which could avoid the loss of negative feature information. Thus, it reduces the overfitting risk of the model without adding any additional parameters. However, the ReLU activation function outputs all negative features as 0, resulting in the loss of all negative features.

Compared with Tanh and Sigmid, ReLU can achieve a higher performance since it can avoid the problem of gradient disappearance [14].

3.4. Effects of Using Different Loss Functions

For the purpose of evaluating the effect of using different loss functions, we conducted an experiment.

As shown in Table 3, it can be seen that the Weights Cross Entropy Loss Function and the Focal Loss Function can achieve much a higher performance than the Mean Square Error Loss Function and Cross Entropy Loss Function because there exists category imbalances in crack segmentation tasks, for the crack pixels are far fewer than the non-crack ones. Since these two loss functions have the capability of down-weighting the major category (non-crack features) through a loss function, they can make the network focus on training the minor category (crack features). Compared with the Weights Cross Entropy Loss Function, the Focal Loss Function can achieve a slightly better result, for the reason that the Focal Loss Function uses more adjustable parameters to adjust the category imbalance.

3.5. Effects of Using a Different Number of Blocks in the Parallel Convolution Module

For the purpose of evaluating the effect of using a different number of blocks in the Parallel Convolution Module, we conducted an experiment.

As shown in Table 4, it can be seen that the number of blocks used in the PCM could affect the final accuracy of the model. From the experimental results, we have the following observations: Firstly, the number of blocks in the PCM has different impacts on the four datasets. The range in variation is about 3%. Secondly, with the increase in the number of blocks in the PCM, the final segmentation accuracy firstly goes up, reaches the maximum value, and then drops down. The main reason is that different levels of features can be fused along with the increasing of blocks. However, it might bring some useless features for the final classification if the number of blocks continues rising. Thus, this useless information would interfere with the final segmentation.

3.6. Comparison of Different Attention Mechanisms

For the purpose of evaluating the effect of using different attention mechanisms in the SGAB, we conducted an experiment.

As shown in Table 5, the use of different attention mechanisms in the SGAB could affect the final accuracy of the model. From the experimental results, we achieved the following observations:

(1): The accuracy of only using Channel Attention in the SGAB or Spatial Attention in the SGAB is lower than the accuracy of these modules with Gate Convolution Filters (GCFs) inside. The reason is that the Gate Convolution Filters (GCF) proposed by us allow features related to cracks to flow to the top layers, while inhibiting the features unrelated to cracks, which effectively controls the flowing of features.
(2): The accuracy of using both Channel Attention and Spatial Attention with a Gate Convolution Filter (GCF) in the SGAB could achieve the best performance, because this mixed attention not only rearranges the features of different channels, but also selects the channel features related to cracks through the channel gates. Moreover, spatial features related to cracks are selected through spatial gates. Thus, these two features blended together could provide more feature information for the top layers.

4. Conclusions

In this paper, we propose a crack detection model based on our proposed PSNet. In the PSNet, a Parallel Convolution Module (PCM) is designed to avoid feature information loss caused by down-sampling. Therefore, more low-level features could be used for the final classification. In addition, a Self-Gated Attention Block (SGAB) is used to control the feature information flow. Thus, features related to cracks would be permitted to flow to the top layers and features unrelated to cracks would be suppressed. Experimental results show that our PSNet model achieved an accuracy of 92.6%, 81.2%, 98.5%, and 76.2% on the Cracktree200, CRACK500, CFD, and OAD_CRACK datasets, respectively, which is 2.6%, 4.2%, 1.2%, and 3.3% higher than that of the traditional attention model. Our work could effectively solve the problem of detecting small edge cracks in the field of crack detection. However, the model size of the PSNet is still very large. Our future research will focus on reducing the size of the PSNet.

Author Contributions

Conceptualization, X.Z. and H.H.; methodology, X.Z.; software, X.Z.; validation, X.Z. and H.H.; formal analysis, X.Z. and H.H.; investigation, X.Z. and H.H.; resources, H.H.; data curation, H.H.; writing—original draft preparation, X.Z. and H.H.; writing—review and editing, H.H.; visualization, H.H.; supervision, H.H.; project administration, H.H.; funding acquisition, H.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to business confidentiality needs.

Conflicts of Interest

The authors declare no conflict of interest.

References

Talab, A.M.A.; Huang, Z.; Xi, F.; HaiMing, L. Detection crack in image using Otsu method and multiple filtering in image processing techniques. Optik 2016, 127, 1030–1033. [Google Scholar]
Zhong, S.; Oyadiji, S.O. Detection of cracks in simply-supported beams by continuous wavelet transform of reconstructed modal data. Comput. Struct. 2011, 89, 127–148. [Google Scholar]
Marques, A.; Correia, P.L. Automatic Road Pavement Crack Detection Using SVM. Master’s Thesis, Instituto Superior Técnico, Lisbon, Portugal, 2012. [Google Scholar]
Attard, L.; Debono, C.J.; Valentino, G.; Di Castro, M.; Masi, A.; Scibile, L. Automatic crack detection using mask R-CNN. In Proceedings of the 2019 11th International Symposium on Image and Signal Processing and Analysis (ISPA), Dubrovnik, Croatia, 23–25 September 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 152–157. [Google Scholar]
Qiao, W.; Zhang, H.; Zhu, F.; Wu, Q. A crack identification method for concrete structures using improved U-net convolutional neural networks. Math. Probl. Eng. 2021, 2021, 6654996. [Google Scholar] [CrossRef]
Nie, M.; Wang, C. Pavement Crack Detection based on yolo v3. In Proceedings of the 2019 2nd International Conference on Safety Produce Informatization (IICSPI), Chongqing, China, 28–30 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 327–330. [Google Scholar]
Dai, B.; Gu, C.; Zhao, E.; Zhu, K.; Cao, W.; Qin, X. Improved online sequential extreme learning machine for identifying crack behavior in concrete dam. Adv. Struct. Eng. 2019, 22, 402–412. [Google Scholar] [CrossRef]
Qu, Z.; Dong, X.Y. Method of feature pyramid and attention enhancement network for pavement crack detection. J. Electron. Imaging 2022, 31, 033019. [Google Scholar] [CrossRef]
Li, R.; Xu, K.; Wu, D.; Zhu, Z. Pixel-level crack detection using an attention mechanism. In Proceedings of the 2021 6th International Conference on Intelligent Computing and Signal Processing (ICSP), Xi’an, China, 9–11 April 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1114–1117. [Google Scholar]
Xiang, X.; Zhang, Y.; El Saddik, A. Pavement crack detection network based on pyramid structure and attention mechanism. IET Image Process. 2020, 14, 1580–1586. [Google Scholar] [CrossRef]
Qu, Z.; Mei, J.; Liu, L.; Zhou, D.Y. Crack detection of concrete pavement with cross-entropy loss function and improved VGG16 network model. IEEE Access 2020, 8, 54564–54573. [Google Scholar] [CrossRef]
Yu, J.; Kim, D.Y.; Lee, Y.; Jeon, M. Unsupervised pixel-level road defect detection via adversarial image-to-frequency transform. In Proceedings of the 2020 IEEE Intelligent Vehicles Symposium (IV), Las Vegas, NV, USA, 19 October–13 November 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1708–1713. [Google Scholar]
Wang, W.; Su, C. Deep learning-based real-time crack segmentation for pavement images. KSCE J. Civ. Eng. 2021, 25, 4495–4506. [Google Scholar] [CrossRef]
Ramachandran, P.; Zoph, B.; Le, Q.V. Swish: A self-gated activation function. arXiv 2017, arXiv:1710.05941. [Google Scholar]
Shi, Y.; Cui, L.; Qi, Z.; Meng, F.; Chen, Z. Automatic road crack detection using random structured forests. IEEE Trans. Intell. Transp. Syst. 2016, 17, 3434–3445. [Google Scholar] [CrossRef]
Chang, C.C.; Lin, C.J. LIBSVM: A library for support vector machines. ACM Trans. Intell. Syst. Technol. (TIST) 2011, 2, 1–27. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Zhang, L.; Yang, F.; Zhang, Y.D.; Zhu, Y.J. Road crack detection using deep convolutional neural network. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 3708–3712. [Google Scholar]
Jenkins, M.D.; Carr, T.A.; Iglesias, M.I.; Buggy, T.; Morison, G. A deep convolutional neural network for semantic pixel-wise segmentation of road and pavement surface cracks. In Proceedings of the 2018 26th European Signal Processing Conference (EUSIPCO), Rome, Italy, 3–7 September 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 2120–2124. [Google Scholar]
Nguyen NT, H.; Le, T.H.; Perry, S.; Nguyen, T.T. Pavement crack detection using convolutional neural network. In Proceedings of the 9th International Symposium on Information and Communication Technology, Da Nang City, Vietnam, 6–7 December 2018; pp. 251–256. [Google Scholar]
Zhang, H.; Wu, C.; Zhang, Z.; Zhu, Y.; Lin, H.; Zhang, Z.; Sun, Y.; He, T.; Mueller, J.; Manmatha, R.; et al. Resnest: Split-attention networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 2736–2746. [Google Scholar]
Li, J.; Liu, Y.; Zhang, Y.; Zhang, Y. Cascaded attention DenseUNet (CADUNet) for road extraction from very-high-resolution images. ISPRS Int. J. Geo-Inf. 2021, 10, 329. [Google Scholar] [CrossRef]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; IEEE: Piscataway, NJ, USA, 2020. [Google Scholar]
Gou, S.; Tong, N.; Qi, S.; Yang, S.; Chin, R.; Sheng, K. Self-channel-and-spatial-attention neural network for automated multi-organ segmentation on head and neck CT images. Phys. Med. Biol. 2020, 65, 245034. [Google Scholar] [CrossRef]
Yang, G.; Geng, P.; Ma, H.; Liu, J.; Luo, J. DWTA-Unet: Concrete Crack Segmentation Based on Discrete Wavelet Transform and Unet. In Proceedings of the 2021 Chinese Intelligent Automation Conference, Fuzhou, China, 16–17 October 2021; Springer: Singapore, 2022; pp. 702–710. [Google Scholar]
Temurtas, F.; Gulbag, A.; Yumusak, N. A study on neural networks using taylor series expansion of sigmoid activation function. In Proceedings of the Computational Science and Its Applications–ICCSA 2004: International Conference, Assisi, Italy, 14–17 May 2004; Springer: Berlin/Heidelberg, Germany, 2004; pp. 389–397. [Google Scholar]
Lau, M.M.; Lim, K.H. Review of adaptive activation function in deep neural network. In Proceedings of the 2018 IEEE-EMBS Conference on Biomedical Engineering and Sciences (IECBES), Sarawak, Malaysia, 3–6 December 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 686–690. [Google Scholar]
Li, Y.; Yuan, Y. Convergence analysis of two-layer neural networks with relu activation. Adv. Neural Inf. Process. Syst. 2017, 30, 597–607. [Google Scholar]
Crnjanski, J.; Krstić, M.; Totović, A.; Pleros, N.; Gvozdić, D. Adaptive sigmoid-like and PReLU activation functions for all-optical perceptron. Opt. Lett. 2021, 46, 2003–2006. [Google Scholar] [CrossRef]
Kato, S.; Hotta, K. Mse loss with outlying label for imbalanced classification. arXiv 2021, arXiv:2107.02393. [Google Scholar]
Mannor, S.; Peleg, D.; Rubinstein, R. The cross entropy method for classification. In Proceedings of the 22nd International Conference on Machine Learning, Bonn, Germany, 7–11 August 2005; pp. 561–568. [Google Scholar]
Phan, T.H.; Yamamoto, K. Resolving class imbalance in object detection with weighted cross entropy losses. arXiv 2020, arXiv:2006.01413. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]

Figure 1. Some sample images of the OAD_CRACK datasets, (a) is the samples of linear crack, (b) is the samples of circular crack, (c) is the samples of void crack.

Figure 2. The structure of our proposed PSNet.

Figure 3. The structure of our proposed parallel-convolution-based U-Net (PSNet).

Figure 4. The structure of the self-gated attention block.

Figure 5. Comparison of the ReLU and Swish, (a) is the ReLU activation function and (b) is the Swish activation function.

Figure 6. Example of the results of our PSNet for detecting crack images.

Table 1. Comparison with the state-of-the-art methods.

Methods	Cracktree200	CRACK500	CFD	OAD_CRACK
CrackForest [15]	0.08	0.199	0.104	0.08
SVM [16]	0.382	0.418	0.32	0.215
FCN [17]	0.39	0.513	0.585	0.416
ConvNet [18]	0.471	0.591	0.579	0.422
U-Net by Jenkins [19]	0.75	0.681	0.851	0.681
U-Net by Nguyen [20]	0.763	0.695	0.856	0.683
Split-Attention Network [21]	0.851	0.73	0.963	0.696
Cascaded Attention DenseUNet [22]	0.863	0.74	0.97	0.695
ECA-Net [23]	0.885	0.753	0.971	0.711
DWTA-UNet [24]	0.90	0.77	0.973	0.729
PSNet [25]	0.926	0.812	0.985	0.762

Table 2. Effects of using different activation functions.

Activation Functions	Cracktree200	CRACK500	CFD	OAD_CRACK
Sigmoid [26]	0.895	0.759	0.972	0.679
Tanh [27]	0.892	0.762	0.971	0.682
ReLU [28]	0.911	0.801	0.979	0.715
PreLU [29]	0.913	0.805	0.981	0.723
Swish [14]	0.926	0.812	0.985	0.762

Table 3. Effects of using different loss functions.

Loss Functions	Cracktree200	CRACK500	CFD	OAD_CRACK
Mean Square Error Loss Function [30]	0.908	0.782	0.975	0.718
Cross Entropy Loss Function [31]	0.919	0.793	0.979	0.732
Weights Cross Entropy Loss Function [32]	0.921	0.802	0.982	0.759
Focal Loss Function [33]	0.926	0.812	0.985	0.762

Table 4. Effects of using different number of blocks in PCM.

Number of Blocks	Cracktree200	CRACK500	CFD	OAD_CRACK
2	0.906	0.792	0.978	0.751
3	0.914	0.805	0.981	0.756
4	0.926	0.812	0.985	0.762
5	0.921	0.809	0.983	0.761

Table 5. Effects of using different attention mechanisms in the SGAB.

Attention Mechanisms	Cracktree200	CRACK500	CFD	OAD_CRACK
Only Channel Attention in SGAB	0.893	0.799	0.973	0.743
Only Spatial Attention in SGAB	0.901	0.782	0.973	0.742
Only Channel Attention with Gate Convolution Filter in SGAB	0.902	0.80	0.975	0.753
Only Spatial Attention with Gate Convolution Filter in SGAB	0.906	0.791	0.976	0.759
Both Channel Attention and Spatial Attention with Gate Convolution Filter in SGAB	0.926	0.812	0.985	0.762

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, X.; Huang, H. PSNet: Parallel-Convolution-Based U-Net for Crack Detection with Self-Gated Attention Block. Appl. Sci. 2023, 13, 9875. https://doi.org/10.3390/app13179875

AMA Style

Zhang X, Huang H. PSNet: Parallel-Convolution-Based U-Net for Crack Detection with Self-Gated Attention Block. Applied Sciences. 2023; 13(17):9875. https://doi.org/10.3390/app13179875

Chicago/Turabian Style

Zhang, Xiaohu, and Haifeng Huang. 2023. "PSNet: Parallel-Convolution-Based U-Net for Crack Detection with Self-Gated Attention Block" Applied Sciences 13, no. 17: 9875. https://doi.org/10.3390/app13179875

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

PSNet: Parallel-Convolution-Based U-Net for Crack Detection with Self-Gated Attention Block

Abstract

1. Introduction

2. Materials and Methods

2.1. Datasets

2.2. Model Structure

2.3. The Parallel Convolution Module

2.4. The Self-Gated Attention Block

2.4.1. Channel Attention Unit

2.4.2. Spatial Attention Unit

2.4.3. Fusing

2.4.4. The Gate Convolution Filter

2.5. Data Augmentation

2.6. Swish Activation Function

3. Results and Discussion

3.1. Experimental Setup

3.2. Comparison with the State-of-the-Art Methods

3.3. Effects of Using Different Activation Functions

3.4. Effects of Using Different Loss Functions

3.5. Effects of Using a Different Number of Blocks in the Parallel Convolution Module

3.6. Comparison of Different Attention Mechanisms

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI