MA-Xnet: Mobile-Attention X-Network for Crack Detection

Wang, Yujie; Wang, Jun; Wang, Chao; Wen, Xin; Yan, Chen; Guo, Yuxiang; Cao, Rui

doi:10.3390/app122111240

Open AccessArticle

MA-Xnet: Mobile-Attention X-Network for Crack Detection

by

Yujie Wang

,

Jun Wang

,

Chao Wang

,

Xin Wen

,

Chen Yan

,

Yuxiang Guo

and

Rui Cao

^*

College of Information and Computer, Taiyuan University of Technology, Taiyuan 030024, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(21), 11240; https://doi.org/10.3390/app122111240

Submission received: 16 October 2022 / Revised: 2 November 2022 / Accepted: 4 November 2022 / Published: 6 November 2022

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Modern crack detection algorithms based on deep learning have unsolved issues, such as an abundance of parameters in the resulting models and lack of context information. Such issues may lower the efficiency of feature extraction and lead to unexpected task performance. Based on two semantic segmentation models, U-Net and the dual-attention network (DANet), an efficient mobile-attention X-network (MA-Xnet) is proposed for crack detection. For performance evaluation, segmentation experiments were performed on concrete crack images from an internationally recognized dataset, which were collected from various campus buildings of Middle East Technical University. The experimental results demonstrated that, compared with U-Net, the proposed method parameters were reduced by 82.33%, and improved by 11.32% and 12.37% in the key indices of the F1-Score and the mean intersection of union (mIoU), respectively, providing a reference for subsequent related lightweight crack-segmentation research.

Keywords:

concrete cracks; semantic segmentation; dual attention; deep learning

1. Introduction

The timely monitoring of pavement cracks is essential for successfully maintaining road infrastructure. In combination with ever-expanding urbanization and the development of expressways, the surface areas of roads under usable conditions are increasing. An effective crack-detection method helps in establishing appropriate road maintenance and repair strategies and adhering to maintenance budgets. A crack segmentation detection method that utilizes image processing and deep learning algorithms has good real-time performance and accuracy in terms of detecting road conditions. An image segmentation algorithm can be used to detect pavement cracks via digital images. Initial automated crack-detection methods include grey-scale threshold-based approaches [1,2] that calculate a threshold for the entire image based on the difference between the cracks and the background, in order to classify the cracked pixels. However, the threshold-setting method has difficulties in detecting intact cracks with complex topologies and is susceptible to noise pixels [3].

In addition to the above methods, edge-based detection methods [4] are used for crack detection, but these methods rely on the strong contrast of the crack image and tend to incorrectly identify noise, such as shadows and oil stains, as cracks [5].

At present, many of the most successful and advanced ideas in deep-learning-based semantic image segmentation models are derived from the fully convolutional network (FCN) model proposed by Long et al. [6] in 2015. FCNs are the cornerstone of deep learning in the field of semantic image segmentation and are also a milestone in this regard. FCNs show how to train conventional neural networks (CNNs) in an end-to-end manner for semantic image segmentation. CNNs have also been widely used in concrete or road detection methods in recent years [7,8,9,10,11]. FCNs are extensions of CNNs, and the final prediction result is an image with semantic segmentation. In 2019, Dung et al. [12] implemented the classification and segmentation of concrete cracks using an FCN as a backbone network. Yang et al. [13] combined FCN and morphological operations to improve the efficiency of crack detection by eliminating the need for manual measurement of features in crack images.

U-Net [14] started as a depth-developed FCN in the field of medical image segmentation. U-Net has become a benchmark for image segmentation in different fields. In particular on the application of the U-Net model for pavement crack detection has received wide attention due to a series of advantages, such as simplicity, efficiency, and ease of construction [15,16,17]. In concrete crack-detection experiments, Liu et al. [18] demonstrated that U-Net spends less training time and performs better than other FCN methods [12,13]. The U-Net model uses VGG-Net as the feature extraction network. Although it can extract effective feature information, due to its too-abundant weight parameters, the model operation is large, causing the network’s training to take a long time. In addition, the convolution method in U-Net has limited detail extraction for features, and it is susceptible to the loss of some fine features when the features are recovered.

To be able to integrate more semantic information without increasing the number of parameters, Chen et al. [19,20,21] proposed a series of deep CNN (DCNN) models called DeepLab. The core of these models is to use atrous convolution; i.e., expanding the receptive field of the convolution kernel by inserting holes in the convolution. However, the disadvantage of DeepLab is that it tends to cause grid effects. Zhou et al. [22] improved DCNN and applied it to the semantic segmentation of cracks without any pre-processing cracks. As the utilized models become increasingly deeper, the complexity also increases.

To improve the performance and reduce the number of parameters in these models, Google has proposed the lightweight CNN MobileNets series [23,24,25], focusing on the mobile use and deployment of CNNs. In 2020, Jing et al. [26] proposed an extremely efficient CNN, Mobile-U-Net, to achieve the segmentation of textile defects. Mobile-U-Net is based on the U-Net architecture, and the encoder is replaced by MobileNetV2. This method greatly improves the segmentation speed of the model and achieves high performance. To further improve segmentation accuracy and effectively utilize contextual semantic information, Fu et al. [27] proposed a dual-attention network (DANet), which added two types of attention modules to the traditional expanded FCN to adaptively integrate local features and global dependencies.

Most of the deep-learning-based crack segmentation methods [28,29,30,31,32] are improvements on the encoder-decoder-based U-Net. However, the deeper layers of the network itself and the huge parameters of these methods make them very limited when applied to crack detection, and they are prone to missed detections or false detections, due to the lack of full utilization of feature information during feature recovery. To address the shortcomings of current deep-learning-based crack segmentation methods for the crack segmentation task, an improved mobile attention semantic segmentation network, MA-Xnet, is proposed, based on the semantic segmentation models U-Net and DANet. The effectiveness of our algorithm is verified via a comparison with the public dataset. The proposed network is innovative, as set out below.

(1): U-Net increases the segmentation accuracy by fusing the convolutional layer features with the deconvolution layer features through skip connections during upsampling learning, but these fused features depend on the features learned by downsampling. To reduce the dependence on the convolutional layer features fused by skip connections, and thus further improve the segmentation accuracy, a new skip connection method, cross-skip connection, is proposed. In this way, an X-shaped dual input and output backbone network, X-net, is constructed. X-Net uses two parallel U-Net models for feature learning, and the skip connection of the two U-type encoders to the decoder in the U-Net is changed to a cross-skip connection, so that the feature mapping of the two parts of the encoders is skip connected to the decoder part of the other, which enables the different feature information extracted from the two parts of the encoders to interact in the decoder part to improve the utilization of the crack feature information.
(2): A position attention module (PA) and a channel attention module (CA) are used in the penultimate layer of the two decoder sections of the model to capture global contextual information in two different dimensions, spatial dimension, and channel dimension, respectively, focusing on the enhancement of the semantic features of the cracks to integrate the semantic information of deep and shallow layers and, thus, enhance the feature representation.
(3): To address the problem of oversized network parameters, we used the 17-layer bottleneck residual depth-wise separable convolutional block of MobileNetV2 as the feature extraction part of the model to make the network lightweight.

2. Proposed Methodology

2.1. Proposed MA-Xnet

MA-Xnet is proposed for crack segmentation, and it aims to reduce parameters and utilize feature information effectively. The MA-Xnet network structure is shown in Figure 1. The encoder part uses MobileNetV2 [24] for feature extraction, and Table 1 shows the overall architecture of MobileNetV2, where t is the expansion factor; n is the number of repeated blocks; s indicates the stride; and c is the final number of channels after the last convolution operation. MobileNetV2 uses depth-separable convolution to replace the standard convolution of U-Net to reduce the parameters of the model, and the bottleneck residual structure can reduce the information loss caused by image compression, especially in the case of low-image resolution, with better robustness.

Meanwhile, to effectively utilize more contextual semantic feature information about cracks, a backbone network, with X-net as the backbone, is constructed to allow different feature information extracted by the two parts of the encoder to interact in the decoder part through cross-skip connections, which reduces the dependence of the deconvolution layer on the features extracted by the convolution layer in the U-Net skip connection fusion approach and further improves the utilization of crack-feature information in upsampling learning. Then, the position attention module (PA) and channel attention module (CA) are used in the penultimate layer of the two decoder sections of the model to capture global contextual information in two different dimensions, spatial dimension and channel dimension, respectively, to focus on enhancing the semantic features of the cracks to integrate the semantic information of deep and shallow layers and, thus, enhance the feature representation. Finally, the two parts of the semantic features through the PA module and the CA module are summed and the resulting segmentation map, with a better pixel-level feature representation, is output.

2.2. Depth-Wise Separable Convolutions

Standard convolutional neural networks use large-size convolutional kernels to obtain larger perceptual fields. However, this approach greatly increases the number of model parameters. Compared with standard convolution, depth-wise separable convolution requires fewer parameters to be tuned, reduces possible overfitting, and allows the good real-time performance of the network due to less computation. Figure 2a represents the standard convolution; Figure 2b represents the depth-wise separable convolution. The core idea is to disassemble a complete version of the standard convolution into two independent convolutional layers. The first layer is called a depth-wise convolution, and it performs lightweight filtering by applying a single convolution filter per input channel. The second layer is a

1 \times 1

convolution, called a pointwise convolution, which is responsible for building new features by calculating a linear combination of the input channels. The input tensor of the standard convolutional layer is calculated, as follows:

L_{i} = h_{i} \times w_{i} \times d_{i}

(1)

where h_i denotes the height of the image, w_i denotes the width of the image, and d_i denotes the channel of the image. After filtering with a filter of size k × k × d_j, an output tensor is generated, as follows:

Cost = h_{i} \cdot w_{i} \cdot d_{i} \cdot d_{j} \cdot k \cdot k

(2)

The cost with depth-wise separable convolution is only:

Cost = h_{i} \cdot w_{i} \cdot d_{i} (k^{2} + d_{j})

(3)

Equation (3) shows the total capacity of the depth-wise separable convolutional operation. The differences between the algorithms of Equations (2) and (3) are shown in Equation (4):

Diff_Ratio = (1 / N) + (1 / k^{2})

(4)

Diff_Ratio indicates the difference between the ratios of Equations (1) and (2). This value is approximately k² times the raw difference, and it can be seen that using the depth-wise separable convolution instead of the standard convolution can significantly reduce the computational effort of the model.

2.3. Bottleneck Residual Block

In the convolutional layer used for feature extraction, its multiplication computation is affected by the dimensionality of the feature map matrix, and the lower the dimensionality of the feature map, the faster the network is computed. However, the dimensionality of the feature map is reduced before the feature extraction, so we first use the bottleneck residual block to expand the dimensionality of the feature map matrix. As shown in Figure 3, the bottleneck residual block is divided into a direct mapping part and a residual part, and the unmodified early activations in the convolutional block can be accessed directly through skip connections, an approach that is crucial for building deep networks. For a low-dimensional compressed feature, the module first uses an “Expand” layer of 1 × 1 convolution to map the low-dimensional space to the high-dimensional space to expand the dimensionality; then, it uses a depth-wise separable convolution to fully extract the crack feature, while significantly reducing the network parameters. Finally, the linear convolution of the “Compress” layer is used to map the high-dimensional space back to the low-dimensional space, which is designed to expand the perceptual field of the network and balance the operational efficiency and feature extraction.

2.4. PA Module

The PA module is designed to focus information on the fractured part of the image area that contributes to the task. Figure 4 shows the PA module diagram; given a local feature, we first feed it into two parallel convolutional layers so that they generate two new feature maps, V and Q, respectively, where, {V, Q}∈R^C×H×W. Then, the features V’ generated by the pooling operation of features V are cut into K features M∈R^{P×H′×W′}, and reshaped into R^P×N′ (N’ = H’

\times

W’). The feature Q is sliced into K features Z∈R^P×H×W (C = P

\times

K) and reshaped into R^P×N (N = W

\times

H). After that, we multiply matrices by the transposition of M and Z and apply the softmax layer to compute the spatial attention map S∈R^N′×N:

S_{j i} = \frac{\exp (M_{i}^{T} \cdot Z_{j}^{T})}{\sum_{i = 1}^{N^{'}} (M_{i}^{T} \cdot Z_{j}^{T})}

(5)

where S_ji measures the effect of the ith position on the jth position. The more similar the feature representations of two locations are, the greater the correlation between them. After computing the spatial attention map, we perform matrix multiplication between the transpose of M and S and reshape the results into R^P×H×W. The results of the K-group features are then combined into L∈R^C×H×W:

L_{j} = \sum_{i = 1}^{N^{'}} S_{j i} \cdot M_{i}

(6)

Finally, we multiply it by a scaling parameter α and perform an element-by-element summation operation on feature A to obtain the following final output G∈R^C×H×W:

G_{j} = α \cdot L_{j} + A_{j}

(7)

where α is initialized to 0 and gradually learns to assign more weights. Equation (7) shows that the resulting feature G for each location is a weighted sum of all location features and original features. Thus, it utilizes the spatial attention network to selectively gather contextual semantic data and has a global contextual view.

2.5. CA Module

The core idea of the channel attention module is inspired by SENet [33]. It is mainly used to model the correlation between feature channels to improve the accuracy of the network. By exploiting the interdependencies between channel maps, we can emphasize interdependent feature maps and improve the feature representations of specific semantics. Figure 5 shows the CA module diagram. The given features A∈R^C×H×W are divided into K groups of features E∈R^{L×H′×W′} after pooling operation on them. The features are then reshaped into E^′∈R^L×N′; then, the matrix multiplication is performed between the transpose of E and E’. Finally, the softmax layer is used to obtain the channel attention map X∈R^L×L:

x_{j i} = \frac{\exp (E_{i} \cdot {E^{'}}_{j})}{\sum_{i = 1}^{N^{'}} (E_{i} \cdot {E^{'}}_{j})}

(8)

where x_ji measures the effect of the ith channel on the jth channel. In addition, A is split into K group features and reshapes them into D∈R^L×N. Next, matrix multiplication is performed between the transpose of X and D, and the result is reshaped as F∈R^L×H×W. Then, the K sets of features are combined to generate:

F_{j} = \sum_{i = 1}^{N^{'}} X_{j i} \cdot D_{i}

(9)

We multiply the above result by the parameter β and perform the element summation operation on A to obtain the final output H∈R^C×H×W:

H_{j} = β \cdot F_{j} + A_{j}

(10)

where β gradually learns the weights, starting at 0. Equation (10) shows that the final features of each channel are the weighted sum of the features of all channels and the original features, and it helps to improve the discriminability of the crack features.

3. Experiment Setup

3.1. Working Environment

The device standards that were used in our experiment are demonstrated: the CPU is an Intel i7-9700, and the GPU is an NVIDIA GeForce RTX 2080 with 16 GB of RAM. The software includes the Windows 10 operating system and the Keras deep learning framework.

3.2. Experimental Details

The crack is a binary semantic segmentation problem, and the network in this paper uses a binary cross-entropy loss function, defined as follows:

L = - \frac{1}{S} \sum_{j = 1}^{S} [y_{j} \lg p_{j} + (1 - y_{j}) \lg (1 - p_{j})]

(11)

where S is the number of image pixels; y_j is the label value of the jth pixel point, and p_j is the predicted probability value of the jth pixel point. The smaller the value of the binary cross-entropy loss function, the closer the distribution of the two functions, and the closer the predicted value is to the true value. The activation function used is the nonlinear activation function ReLU6, which is a variant of the ReLU function that limits the maximum output to 6 and is more robust at low precision. Its expression is as follows:

ReLU (x) = {\begin{matrix} 0, & x \leq 0 \\ x, & 0 \leq x \leq 6 \\ 6, & x \geq 6 \end{matrix}

(12)

To quantitatively evaluate the performance of the algorithm, we use Precision (P), Recall (R), the harmonic mean F1 score of precision and recall (F-Measure, F), and the mean intersection over union (mIoU) to quantitatively analyze the experimental results. These metrics are defined as follows:

P = \frac{TP}{TP + FP}

(13)

R = \frac{TP}{TP + FN}

(14)

F = \frac{(α^{2} + 1) \times P \times R}{α^{2} \times (P + R)}

(15)

mIoU = \frac{1}{k + 1} \sum_{i = 0}^{k} \frac{TP}{FN + FP + TP}

(16)

where TP is the number of correctly identified pixels, FP is the number of pixels wrongly identified as cracks, TN represents the number of pixels correctly divided as background, and FN represents the number of pixels wrongly identified as background. F1 integrates Recall and Precision, and when α of Equation (15) is 1, the weighted summed average F can fully reflect the performance of the algorithm.

3.3. Experimental Datasets and Parameter Setting

To confirm the value of the approach presented in this paper, segmentation experiments were performed on the publicly available dataset concrete crack images collected at various campus buildings of Middle East Technical University (MDTU) [34]. MDTU uses the method proposed by Zhang et al. [35], and a dataset of 20,000 positive samples with cracks and 20,000 negative samples without cracks was obtained. For our datasets, 1000 crack images and 1000 non-crack images were randomly selected from the 40,000 total images in the dataset and they were randomly separated into the training, with test datasets at a ratio of 4:1. In this study, we used the Adam optimizer with an initial learning rate of 0.001, a training batch size set to 10, and several iterations of 50.

4. Experiment Study

4.1. Experimental Results on MDTU

The proposed method in this paper is an improvement of the U-Net-based network structure, and the encoder part uses MobileNet for feature extraction, which greatly reduces the number of parameters in the network model. To focus on more semantic information about cracks, we use cross-skip connection and, furthermore, add PA and CA modules in the decoder part to enhance the feature representation of cracks in both spatial and channel dimensions. Finally, the two parts are summed and fused to produce a high segmentation accuracy result. Therefore, U-Net, MobileNetV2-UNet (which also uses MobileNet as the encoder for feature extraction), and DANet (which uses the dual-attention mechanism in parallel), are chosen as experimental comparison models with MA-Xnet to qualitatively verify the performance of the proposed model. The pixel accuracy curves and loss function curves of different methods are shown in Figure 6 and Figure 7.

It can be seen in Figure 6 that the growth of each model method stabilizes around the 40th trip. Compared with U-Net, DANet, and MobileNetV2-UNet, the pixel accuracy curve of the proposed model is the highest, indicating that the performance of the proposed model is effectively improved, compared with that of the baseline network U-Net. Meanwhile, it can be seen in the loss function curves in Figure 7 that the MA-Xnet and the lightweight network MobileNetV2-UNet converge faster, due to the use of the lightweight neural network MobileNet as the feature extractor, compared with the DANet and U-Net with larger network parameters and deeper network layers. Although the convergence speed of MA-Xnet and MobileNetV2-UNet is similar, MA-Xnet makes more effective use of the fracture feature information due to the cross-skip connection and the inclusion of the dual-attention module, which makes it higher than MobileNetV2-UNet in terms of the accuracy curve. On the other hand, Recall, Precision, F1-Score, and mean intersection of union (mIoU) are used to qualitatively analyze the performance of different networks, with the metric comparison shown in Table 2.

As shown in Table 2, this paper’s method outperforms the compared methods in several metrics, especially the key metrics of the F1-Score and the mIoU. Due to the shallow feature information extracted by U-Net, the semantic information features of channels and spatial dimension are not effectively considered, resulting in coarser segmentation results. Compared with U-Net, the proposed model in this paper improves by 11.32% and 12.37% the key indices of the F1-Score and the mIoU, respectively. The number of parameters of the proposed model in this paper is reduced by 82.33% compared with U-Net before the improvement. This is because the use of the bottleneck residual depth-wise separable convolution allows the proposed model to effectively avoid the problems of gradient explosion and overfitting due to the deeper layers of the network, thereby improving the operational efficiency of the network. Compared with MobileNetV2-UNet, a lightweight neural network that also applies deep separable convolution, the application of cross-skip connection, and dual-attention mechanism of the proposed model simultaneously increases the weight of useful feature information in the spatial dimensions and channels for cracked images, and thus improves over MobileNetV2-UNet in the F1-Score and the mIoU by 5.31% and 8.06%, respectively. Compared with DANet, which also applies the dual-attention mechanism, the feature extractor and cross-skip connection of the proposed model appear to be more advantageous, with a 4.34% and 6.05% improvement, respectively, in the key indices of the F1-Score and the mIoU over the DANet network. In terms of detection time, although the use of depth-wise separable convolution greatly reduces the number of parameters, the time is slightly higher than that of U-Net, due to the application of the dual attention module. Figure 8 and Figure 9 provide a plot of the segmentation results for some different models as a way of showing their segmentation performance more visually.

As can be seen from Figure 8, for the first, fifth, and sixth finer cracks, U-Net, MobileNetV2-UNet, and DANet show different degrees of fracture, while the proposed model does not show significant discontinuities or missed detections. For the second, third, and fourth cracks next to the dot-shaped cracks, the location and size of the proposed model are also most accurate in terms of segmentation results, compared with the other networks. Figure 9 shows a segmentation diagram after applying a blue box to zoom in on a detailed feature of Figure 8, aiming to show more clearly the segmentation ability of each model in detail. The first column is the ground truth and the second column is the enlarged ground truth, where the enlarged portion is marked with a blue box in the first column. It can be seen from Figure 9 that, compared with the other models, the proposed model is the best in recovering the detailed features of cracks, and the segmentation results are clearer and closest to ground truth, thereby proving that the proposed model makes more effective use of the crack feature information.

4.2. Ablation Experiments

4.2.1. Validation of the Validity of Different Modules

Ablation experiments were performed on the modules to further verify that the revised method may increase the model’s accuracy in segmenting cracks. The model contains a bottleneck residual depth-wise separable convolution block for feature extraction, a PA module, and a CA module. The contrasting object models include U-Net, M-UNet (a lightweight network using bottleneck residual depth separable convolution as feature extractor on U-Net), PAM-UNet (with only PA module added to M-UNet), CAM-UNet (with only CA module added to M-UNet), and MA-Xnet (with dual attention module (PA) and (CA) added to M-UNet). The tests were conducted in the same experimental environment and the experimental results are shown in Table 3.

The experimental data shown in Table 3 demonstrate that feature extraction using the bottleneck residual depth-wise separable convolution, which reduces the network parameters while taking into account the way the network extracts features, improves over the baseline network U-Net by 4.38% and 4.97% in the key metrics of the F1-Score and the mIoU, respectively. Further addition of the position attention module enables the network to capture the mapping of local and global dependencies of crack features, integrating remote contextual information in the crack pixel spatial and improving over the lightweight network M-UNet by 4.32% and 3.73% in the key metrics of the F1-Score and the mIoU, respectively. Further addition of the channel attention module allows the model to selectively enhance the weight of channels, with beneficial features for fracture information, and suppress the weight of useless features, improving the F1-Score and the mIoU by 1.89% and 2.28%, respectively, compared with the lightweight network MU-Net. The F1-Score of the final parallel incorporation of the two attention modules reached 90.53% and the mIoU reached 81.32%, which were substantial improvements in the key indices, thereby proving that the proposed model effectively improves the segmentation accuracy of cracks.

4.2.2. Verification of the Validity of Cross-Skip Connections

A comparison test between the cross-skip connection and the standard connection was conducted to verify the advantages of cross-skip connection in the proposed model of backbone X-net. Figure 10a shows the structure of a cross-skip connection X-net, while Figure 10b shows the structure of a standard skip connection. The results are shown in Figure 11, from which it can be seen that the use of cross-skip connection outperformed the standard skip connection approach in the key metrics of the F1-Score and the mIoU. The F1-Score and the mIoU of the cross-skip connection were 3.37% and 2.2% higher than those of the standard skip connection, respectively, which demonstrates that the use of cross skip allows the different feature information extracted by the two parts of the encoder to interact in the decoder part, thereby improving the utilization of the crack feature information.

5. Conclusions

In this paper, an improved crack segmentation neural network, MA-Xnet, was proposed. The encoder part used bottleneck residual depth-wise separable convolutional blocks for feature extraction to reduce the number of parameters in the network, while a novel cross-skip connection was proposed with a dual attention mechanism to integrate additional crack detail feature information. The experimental results demonstrated that the enhanced network model suggested in this research performs well on the publicly available dataset with only one-sixth of the parameters of the original model, and achieves 90.53% and 81.32% in the key metrics of the F1-Score and the mean intersection of union (mIoU), respectively. The proposed MA-Xnet will subsequently be optimized so that it can be applied to real-time road-crack image segmentation on small mobile devices.

Author Contributions

Data curation, Y.W.; Funding acquisition, R.C.; Investigation, C.Y.; Methodology, Y.W.; Resources, X.W. and Y.G.; Software, C.W.; Validation, J.W. and R.C.; Writing—original draft, Y.W.; Writing—review & editing, Y.W. All authors have read and agreed to the published version of the manuscript.

Funding

Financial support for this research was provided by the National Natural Science Foundation of China (Project Number: 61873178) and the Natural Science Foundation of Shanxi (Project Number: 201901D111093).

Conflicts of Interest

The authors declare no conflict of interest.

References

Banharnsakun, A. Hybrid ABC-ANN for pavement surface distress detection and classification. Int. J. Mach. Learn. Cybern. 2015, 8, 699–710. [Google Scholar] [CrossRef]
Oliveira, H.; Correia, P.L. Automatic Road Crack Segmentation Using Entropy And Image Dynamic Thresholding. In Proceedings of the 2009 17th European Signal Processing Conference, Glasgow, UK, 24–28 August 2009; pp. 622–626. [Google Scholar] [CrossRef]
Peng, C.; Yang, M.; Zheng, Q.; Zhang, J.; Wang, D.; Yan, R.; Wang, J.; Li, B. A triple-thresholds pavement crack detection method leveraging random structured forest. Constr. Build. Mater. 2020, 263, 120080. [Google Scholar] [CrossRef]
Nhat-Duc, H.; Nguyen, Q.-L.; Tran, V.-D. Automatic recognition of asphalt pavement cracks using metaheuristic optimized edge detection algorithms and convolution neural network. Autom. Constr. 2018, 94, 203–213. [Google Scholar] [CrossRef]
Lau, S.L.H.; Chong, E.K.P.; Yang, X.; Wang, X. Automated Pavement Crack Segmentation Using U-Net-Based Convolutional Neural Network. IEEE Access 2020, 8, 114892–114899. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015. [Google Scholar]
Escalona, U.; Arce, F.; Zamora, E.; Azuela, J.H.S. Fully Convolutional Networks for Automatic Pavement Crack Segmentation. Comput. Sist. 2019, 23, 451–460. [Google Scholar] [CrossRef]
Li, S.; Zhao, X.; Zhou, G. Automatic pixel-level multiple damage detection of concrete structure using fully convolutional network. Comput. Aided Civil Infrastruct. Eng. 2019, 34, 616–634. [Google Scholar] [CrossRef]
Liu, Y.; Yao, J.; Lu, X.; Xie, R.; Li, L. DeepCrack: A deep hierarchical feature learning architecture for crack segmentation. Neurocomputing 2019, 338, 139–153. [Google Scholar] [CrossRef]
Zhang, J.; Lu, C.; Wang, J.; Wang, L.; Yue, X.-G. Concrete Cracks Detection Based on FCN with Dilated Convolution. Appl. Sci. 2019, 9, 2686. [Google Scholar] [CrossRef] [Green Version]
Islam, M.; Hossain, B.; Akhtar, N.; Moni, M.A.; Hasan, K.F. CNN Based on Transfer Learning Models Using Data Augmentation and Transformation for Detection of Concrete Crack. Algorithms 2022, 15, 287. [Google Scholar] [CrossRef]
Dung, C.V.; Anh, L.D. Autonomous concrete crack detection using deep fully convolutional neural network. Autom. Constr. 2018, 99, 52–58. [Google Scholar] [CrossRef]
Yang, X.; Li, H.; Yu, Y.; Luo, X.; Huang, T.; Yang, X. Automatic Pixel-Level Crack Detection and Measurement Using Fully Convolutional Network. Comput.-Aided Civil Infrastruct. Eng. 2018, 33, 1090–1109. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Cham, Switzerland, 2015. [Google Scholar]
Konig, J.; Jenkins, M.D.; Barrie, P.; Mannion, M.; Morison, G. A Convolutional Neural Network for Pavement Surface Crack Segmentation Using Residual Connections and Attention Gating. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; pp. 1460–1464. [Google Scholar] [CrossRef] [Green Version]
Jenkins, M.D.; Carr, T.A.; Iglesias, M.I.; Buggy, T.; Morison, G. A Deep Convolutional Neural Network for Semantic Pixel-Wise Segmentation of Road and Pavement Surface Cracks. In Proceedings of the 2018 26th European Signal Processing Conference (EUSIPCO), Rome, Italy, 3–7 September 2018; pp. 2120–2124. [Google Scholar] [CrossRef] [Green Version]
Zhang, L.; Shen, J.; Zhu, B. A research on an improved Unet-based concrete crack detection algorithm. Struct. Heal. Monit. 2020, 20, 1864–1879. [Google Scholar] [CrossRef]
Liu, Z.; Cao, Y.; Wang, Y.; Wang, W. Computer vision-based concrete crack detection using U-net fully convolutional networks. Autom. Constr. 2019, 104, 129–139. [Google Scholar] [CrossRef]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and fully Connected Crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Chen, L.C.; Papandreou, G.; Schroff, F.; Adam, H. Rethinking atrous convolution for semantic image segmentation. arXiv 2017, arXiv:1706.05587. [Google Scholar]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Zhou, S.; Song, W. Concrete roadway crack segmentation using encoder-decoder networks with range images. Autom. Constr. 2020, 120, 103403. [Google Scholar] [CrossRef]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019. [Google Scholar]
Jing, J.; Wang, Z.; Rätsch, M.; Zhang, H. Mobile-Unet: An efficient convolutional neural network for fabric defect detection. Text. Res. J. 2020, 92, 30–42. [Google Scholar] [CrossRef]
Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Long Beach, CA, USA, 16–17 June 2019. [Google Scholar]
Liu, F.; Wang, L. UNet-based model for crack detection integrating visual explanations. Constr. Build. Mater. 2022, 322, 126265. [Google Scholar] [CrossRef]
Wang, L.; Ma, X.-H.; Ye, Y. Computer vision-based Road Crack Detection Using an Improved I-UNet Convolutional Networks. In Proceedings of the 2020 Chinese Control And Decision Conference (CCDC), Hefei, China, 22–24 August 2020; pp. 539–543. [Google Scholar] [CrossRef]
Yang, Y.; Zhao, Z.; Su, L.; Zhou, Y.; Li, H. Research on Pavement Crack Detection Algorithm based on Deep Residual Unet Neural Network. J. Physics: Conf. Ser. 2022, 2278, 012020. [Google Scholar] [CrossRef]
Fan, X.; Cao, P.; Shi, P.; Wang, J.; Xin, Y.; Huang, W. A Nested Unet with Attention Mechanism for Road Crack Image Segmentation. In Proceedings of the 2021 IEEE 6th International Conference on Signal and Image Processing (ICSIP), Nanjing, China, 22–24 October 2021. [Google Scholar]
Gao, X.; Jin, B. Research on Crack Detection Based on Improved UNet. In Proceedings of the 2020 7th International Conference on Information Science and Control Engineering (ICISCE), Changsha, China, 18–20 December 2020. [Google Scholar]
Zhang, H.; Goodfellow, I.; Metaxas, D.; Odena, A. Self-attention generative adversarial networks. In Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019. [Google Scholar]
Zhang, L.; Yang, F.; Zhang, Y.D.; Zhu, Y.J. Road crack detection using deep convolutional neural network. In Proceedings of the 2016 IEEE international conference on image processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016. [Google Scholar]
Zhang, H.; Tan, J.; Liu, L.; Wu, Q.M.J.; Wang, Y.; Jie, L. Automatic crack inspection for concrete bridge bottom surfaces based on machine vision. In Proceedings of the 2017 Chinese Automation Congress (CAC), Jinan, China, 20–22 October 2017; pp. 4938–4943. [Google Scholar] [CrossRef]

Figure 1. Proposed MA-Xnet architecture.

Figure 2. Two types of convolution images: (a) standard convolution; (b) depth-wise separable convolution.

Figure 3. Configuration of the bottleneck residual block.

Figure 4. The structure of PA modules.

Figure 5. The structure of CA modules.

Figure 6. Pixel accuracy curve comparison.

Figure 7. Loss function curve comparison.

Figure 8. Contrast experiment chart.

Figure 9. Detailed feature comparison diagram.

Figure 10. Two types of skip connection: (a) cross-skip connection; (b) standard skip connection.

Figure 11. Comparison of experimental results of two types of skip connection.

Table 1. The architecture of MobileNetV2.

Input Size	Function	t	n	s	c
(3, 224, 224)	Conv2D	-	1	2	32
(32, 112, 112)	Inverted residual	1	1	1	16
(16, 112, 112)	Inverted residual	6	2	2	24
(24, 56, 56)	Inverted residual	6	3	2	32
(32, 28, 28)	Inverted residual	6	4	2	64
(64, 14, 14)	Inverted residual	6	3	1	96
(96, 14, 14)	Inverted residual	6	3	2	160
(160, 7, 7)	Inverted residual	6	1	1	320
(320, 7, 7)	Conv2D	6	1	1	1280
(1280, 7, 7)	-	-	-	-	-

Table 2. Comparison experiment result table.

Model	Precision/%	Recall/%	F1/%	mIoU/%	Parameters	Time/(s)
U-Net	95.08	68.62	79.21	68.95	36,968,449	0.012
MobileNetV2-UNet	95.54	76.16	85.22	73.26	6,504,227	0.007
DANet	91.48	79.47	86.19	75.27	71,727,307	0.024
MA-Xnet	92.34	82.57	90.53	81.32	6,533,527	0.015

Table 3. Ablation experimental results.

Model	Precision/%	Recall/%	F1-Score/%	mIoU/%
U-Net	95.08	68.62	79.21	68.95
M-UNet	91.15	77.21	83.59	73.92
PAM-UNet	94.80	80.03	87.91	77.65
CAM-UNet	92.19	79.61	85.48	76.20
MA-Xnet	92.34	82.57	90.53	81.32

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Y.; Wang, J.; Wang, C.; Wen, X.; Yan, C.; Guo, Y.; Cao, R. MA-Xnet: Mobile-Attention X-Network for Crack Detection. Appl. Sci. 2022, 12, 11240. https://doi.org/10.3390/app122111240

AMA Style

Wang Y, Wang J, Wang C, Wen X, Yan C, Guo Y, Cao R. MA-Xnet: Mobile-Attention X-Network for Crack Detection. Applied Sciences. 2022; 12(21):11240. https://doi.org/10.3390/app122111240

Chicago/Turabian Style

Wang, Yujie, Jun Wang, Chao Wang, Xin Wen, Chen Yan, Yuxiang Guo, and Rui Cao. 2022. "MA-Xnet: Mobile-Attention X-Network for Crack Detection" Applied Sciences 12, no. 21: 11240. https://doi.org/10.3390/app122111240

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MA-Xnet: Mobile-Attention X-Network for Crack Detection

Abstract

1. Introduction

2. Proposed Methodology

2.1. Proposed MA-Xnet

2.2. Depth-Wise Separable Convolutions

2.3. Bottleneck Residual Block

2.4. PA Module

2.5. CA Module

3. Experiment Setup

3.1. Working Environment

3.2. Experimental Details

3.3. Experimental Datasets and Parameter Setting

4. Experiment Study

4.1. Experimental Results on MDTU

4.2. Ablation Experiments

4.2.1. Validation of the Validity of Different Modules

4.2.2. Verification of the Validity of Cross-Skip Connections

5. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI