Enhancing Road Crack Localization for Sustainable Road Safety Using HCTNet

Yadav, Dhirendra Prasad; Sharma, Bhisham; Chauhan, Shivank; Amin, Farhan; Abbasi, Rashid

doi:10.3390/su16114409

Open AccessArticle

Enhancing Road Crack Localization for Sustainable Road Safety Using HCTNet

by

Dhirendra Prasad Yadav

¹,

Bhisham Sharma

^2,*

,

Shivank Chauhan

¹,

Farhan Amin

^3,*

and

Rashid Abbasi

^4,5

¹

Department of Computer Engineering & Applications, G.L.A. University, Mathura 281406, Uttar Pradesh, India

²

Centre of Research Impact and Outcome, Chitkara University, Rajpura 140401, Punjab, India

³

School of Computer Science and Engineering, Yeungnam University, Gyeongsan 38541, Republic of Korea

⁴

School of Computer Science and Artificial Intelligence, Wenzhou University, Wenzhou 325035, China

⁵

School of Electrical Engineering, Anhui Polytechnic University, Wuhu 241000, China

^*

Authors to whom correspondence should be addressed.

Sustainability 2024, 16(11), 4409; https://doi.org/10.3390/su16114409

Submission received: 12 April 2024 / Revised: 13 May 2024 / Accepted: 20 May 2024 / Published: 23 May 2024

(This article belongs to the Special Issue Road Safety and Road Infrastructure Design)

Download

Browse Figures

Versions Notes

Abstract

:

Road crack detection is crucial for maintaining and inspecting civil infrastructure, as cracks can pose a potential risk for sustainable road safety. Traditional methods for pavement crack detection are labour-intensive and time-consuming. In recent years, computer vision approaches have shown encouraging results in automating crack localization. However, the classical convolutional neural network (CNN)-based approach lacks global attention to the spatial features. To improve the crack localization in the road, we designed a vision transformer (ViT) and convolutional neural networks (CNNs)-based encoder and decoder. In addition, a gated-attention module in the decoder is designed to focus on the upsampling process. Furthermore, we proposed a hybrid loss function using binary cross-entropy and Dice loss to evaluate the model’s effectiveness. Our method achieved a recall, F1-score, and IoU of 98.54%, 98.07%, and 98.72% and 98.27%, 98.69%, and 98.76% on the Crack500 and Crack datasets, respectively. Meanwhile, on the proposed dataset, these figures were 96.89%, 97.20%, and 97.36%.

Keywords:

sustainable; road; crack; fusion; segmentation; CNN; vision transformer

1. Introduction

Road cracks can be caused by several factors, including normal ageing, settling, seismic activity, and environmental influences [1]. Identifying and restoring road cracks is of the utmost concern to prevent further damage for sustainable road safety. Furthermore, preserving the structure’s integrity and extending the structure’s lifespan is essential to save the economy [2]. The traditional techniques for crack detection typically involve manual examination, which may be significantly costly in terms of time and labour. In addition, it is also prone to errors caused by human intervention.

As a result, there is an increasing need for automated algorithms that can accurately and efficiently segment crack regions in pavement images to detect cracks [3,4]. Conventional crack segmentation approaches employ edge detection, thresholding, region expansion, and morphological procedures [5,6,7]. Typically, these methods depend on features created by hand to detect cracks in images. Edge detection techniques, such as Sobel, Canny, and Prewitt, can be employed to identify edges within the crack image. The fracture regions have been segmented by applying a threshold to these edges, as described in Reference [8]. Nevertheless, these methods are susceptible to being adversely influenced by other factors, such as image resolution, illumination conditions, and signal interference. Enhancing their effectiveness in challenging real-world situations, including cracks and varying lighting conditions, could prove beneficial [9]. Crack segmentation has been extensively investigated with advanced image processing techniques, including texture analysis, machine learning, and computer vision algorithms. To accurately identify the textural features of cracks and distinguish them from other areas in the image, texture evaluation methods such as local binary patterns (LBPs), grey-level co-occurrence matrix (GLCM), and wavelet transform are employed [10]. The application of several machine learning techniques has successfully classified areas as either crack or non-crack. The methods include support vector machines (SVM), random forests, and k-nearest neighbours (NN). Features are retrieved manually in the machine learning methods to train the model for crack categorization [11,12,13]. Crack regions have been segmented using active contours, level settings, and graph cuts. These methods employ either region-based or boundary-based techniques to divide cracks into segments. These state-of-the-art image-processing methods can achieve a high level of precision. As stated in Reference [14], there is still ample opportunity to improve by accurately reproducing complex fracture patterns and handling variations in lighting conditions.

Crack segmentation in roads has gained increasing attention in recent years through convolutional neural networks (CNNs) [15]. CNNs can obtain hierarchical representations from large datasets to capture complex image patterns and features [16]. Several variations of CNNs, such as U-Net, Fully Convolutional Networks (FCN), and DeepLabv3+, have been proposed for crack segmentation and have shown promising results [17,18,19,20,21,22]. In most cases, the categorization of small image patches could not reach accuracy at the pixel level, while reaching accuracy at the pixel level during searching for cracks. Fully convolutional networks modifying fully convolutional encoder–decoder networks have been created for pixel-level segmentation in medical image processing. Crack segmentation is conceptually comparable to the process of retinal vascular segmentation. For instance, UNet [23], ref. [24] and FusionNet [25], which are typically utilized for the segmentation of medical pictures, can be adapted for usage in the segmentation of cracks at the pixel level. For crack segmentation, several deep CNN methods using SegNet [26], UNet [27], and their respective variations [28], have been implemented.

Both UNets and SegNets use an encoder–decoder structure. The encoder module uses a set of convolution and pooling layers to obtain high-level semantic features, and the decoder part can use the encoder’s high-resolution features to recover lost spatial information by exploiting memorized pooling indices or skip connections. These techniques, in general, are based on stacked layers of 3 × 3 pooling and convolution layers. Unfortunately, they could not attain precision on the pixel level, which led to coarse image segmentation. They usually fail to segment long cracks, producing discontinuous results because of the reduced receptive field associated with the small convolutional kernels. Hou et al. (2022) performed pavement crack segmentation using the DL method. They expanded the size of the dataset by using an image augmentation technique. After that, the RGB image was converted to a binary image. Further, ReseNet-50 was utilized to segment road cracks [29].

Qu et al. (2022) proposed a similar method for road crack segmentation using CA-SE-ResNet-50. They tested model performance on the four datasets and achieved 0.86 precision on the DEEPCRACK dataset [30]. Chen and Lin et al. (2021) applied a feature fusion-based DL model for road crack segmentation. The hybrid atrous convolutional network (HACNet) uses four convolution layers and performs feature aggregation to maintain valuable features. The results obtained from the six datasets confirm model robustness [31]. Huda et al. (2023) recently proposed a hybrid model with knowledge transfer among the class activation maps (KTCAMs) to visualize the crack region. In addition, their method refines the boundary region for better segmentation and achieves state-of-the-art performance on Crack500, CFD, CrackSC and DeepCrack datasets [32]. Pan et al. (2023) applied a generative adversarial network (GAN)-based CNN model for segmenting the road crack. On the Crack500 and CrackForest datasets, the CrackSegAN obtained an average F1-score of 0.8412 and 0.9780, respectively [33]. Jing et al. (2023) perform multi-region road segmentation using the DL method. To produce the segmentation map of the pavement crack instances, crack instances are mapped into the frame area space, the overall area and the core area. They tested the model on the CrackForest dataset and achieved 83% accuracy [34].

The significant contribution of the proposed method is as follows.

(a): We proposed HCTNet (Hybrid Convolution Transformer Network), which has an encoder as a ViT and convolution layers as a decoder to enhance the long-range spatial-feature dependency.
(b): We designed a gated attention block in the decoder module to provide local attention to the spatial features, which improved the segmentation performance.
(c): We utilized a fusion-based loss function designed using binary cross-entropy and dice loss.
(d): The superiority of the model is validated on three datasets. Out of these, two datasets are open-source datasets, and our team created the third dataset. The details of the third dataset are discussed in Section 3.

The rest of the manuscript is organized as follows.

Section 2 provides a detailed overview of the proposed method. In Section 3, we elaborate on the quantitative and visual results. Further, in Section 4, we discuss the performance measures. Finally, Section 5 concludes the proposed method.

2. Materials and Methods

In this section, we discuss the proposed HCTNet and loss function for segmenting the road cracks.

2.1. The HCTNet

In the proposed study, we utilized one convolution layer of kernel size 64 to capture the spatial information. Further, we performed conv embedding and passed it to the vision transformers as encoders for segmentation tasks. Including the ViTs ensures spatial information is retained throughout the model, leveraging the transformers’ capability of modelling long-range dependencies. Furthermore, as depicted in Figure 1, we employ a series of convolutional layers, succeeded by normalization layers, to transition the updated tensors from the embedded space to the input space at each resolution. At the bottleneck of the encoder, we introduced a deconvolution layer to enhance the resolution by a factor of 2 to the transformed feature map. In addition, the gated-attention module follows the deconvolution layer within the decoder. Furthermore, a Softmax layer at the top of the model is added to localize the cracks. Let an image

I \in R^{H \times W \times C}

having height H, width W and channel C be passed to a 2D CNN layer with a kernel of size 64 for extracting spatial features. Further, extracted features are flattened into D dimension, and 1D positional embedding is applied to patch embedding, as shown in Equation (1).

z_{0} = [y_{u}^{1} E; y_{u}^{2} E; . . . .; y_{u}^{N} E] + E_{p o s}

(1)

Furthermore, the stacking of transformer block [35,36,37] with MSA (multi-head self-attention) and MLP (multi-layer perceptron) is carried out according to the following rules.

{z^{'}}_{j} = M S A (N o r m (z_{j - 1})) + z_{j - 1}, j = 1, 2 \dots L

(2)

{z^{'}}_{j} = M L P (N o r m ({z^{'}}_{j})) + {z^{'}}_{j}, j = 1, 2 \dots L

(3)

where L denotes the total layers, the Norm () function is used for normalization [38], and MLP consists of the GELU function.

We utilized n parallel SA (self-attention) heads to construct the MSA sublayer. A parameterized function called the SA block matches a query (q) in a sequence with the key (k) and value (v) components in a sequence

z \in R^{N \times D}

. To calculate attention weights (A), measure the correlation between two z elements and their corresponding key-value pairs, and it is calculated as follows.

A = S o f t \max (\frac{q k^{T}}{\sqrt{D_{h}}})

(4)

where

D_{h} = \frac{D}{n}

is a scaling factor to maintain the number of parameters with k. The output attention head using A with sequence z is calculated as follows.

H_{o u t} (z) = A_{h} v_{h}

(5)

Furthermore, to compute the MSA, the outputs from each head are concatenated and linearly projected.

M S A (z) = [H_{o u t 1} (z); H_{o u t 2} (z); . . . . H_{o u t n} (z)] W_{m s a}

(6)

where

W_{m s a} \in R^{n \times D_{h} \times D}

represents MHT (multi-head trainable) weights. The decoder module is inspired by the architecture of the U-Net [39]. Where several encoder resolutions are combined with the decoder, we use the transformer to extract a sequence representation z_j (j = 3, 6, 9, 12) of size

\frac{H \times W}{P^{2}} \times D

and reshape it into a tensor. Our notion of a representation locates it in the embedding space after having it reshaped by a transformer with a feature size of D. Furthermore, as illustrated in Figure 1, we applied

3 \times 3

convolutional layers, followed by BN (batch normalization), to project the modified tensors at each resolution into the input space from the embedded space.

2.2. The Decoder with Gated Attention

In the proposed study, we have used pure CNN layers in the decoder for upsampling the features. In the decoder, a gated-attention block is implemented, which improves the focus on spatial features [36]. Furthermore, the extracted spatial features map is passed to the two-level residual attention block (see Figure 2). Let the encoded feature map be

F \in R^{H_{0} \times D_{0} \times B_{0}}

. From the feature map F, we calculated a 1D channel-attention map

w_{b} \in R^{B \times 1 \times 1}

which dynamically adjusts the spatial information used for object recognition. After that, a two-dimensional map of spatial normalization is calculated to provide attention. The overall gated attention is formulated as follows.

F_{c n} = w_{b} (f) \otimes f, F_{s n} = w_{s} \otimes F_{c n}

(7)

where f = concatenation of CNN and transformer features and

\otimes

= element-wise multiplication. To better utilize the inter-channel correlation of feature maps, we dynamically modify each channel’s weight to improve channel normalization. The first step is to calculate each layer’s average global pooling. Then, an entirely linked layer with comprehensive connectivity is employed to modify each channel correctly.

F_{c n} = σ (M L P (A v g P o o l (f))), F_{c n} = σ (w 2 (w 1 (f_{a v g}^{b})))

(8)

where

w 1

and

w 2

are MLP weights and σ represents the sigmoid activation function.

2.3. The Hybrid Loss Function

In the study, we proposed a hybrid loss function that draws inspiration from cross-entropy and dice loss. Cross-entropy loss is efficient in training classifiers. However, it may not directly optimize the alignment between the predicted and actual segments. However, dice loss and IoU loss prioritize optimizing for overlap, although they may be more stable and accessible than cross-entropy loss. The binary segmentation loss using binary cross-entropy can be described as follows.

L_{B C E} = - \sum_{j = 1}^{N} [L_{j} \log (p_{j}) + (1 - L_{j}) \log (1 - p_{j})]

(9)

where

L_{j}

= the true label of the pixel j,

p_{j}

= the predicted value of the pixel j and

N

= the total number of pixels in image. The dice loss for binary segmentation essence is calculated as follows.

L_{D i c e} = 1 - \frac{2 \sum_{j = 1}^{N} p_{j} L_{j} + γ}{\sum_{j = 1}^{N} p_{j} + \sum_{j = 1}^{N} L_{j} + γ}

(10)

where

L_{j}

= the true label of pixel j,

p_{j}

= the predicted value of pixel j,

N

= the total number of pixels in the image, and

γ

= the small smoothing value (1 × 10⁻⁵). Finally, the combined loss of the proposed method is mathematically defined as follows.

L_{C o m b i n e d} = α L_{B C E} + β L_{D i c e}

(11)

where

α

and

β

are weights to control the distribution of each class. The algorithm of the proposed methods is as follows (Algorithm 1).

Algorithm 1 Algorithm for the road crack localization using HCTNet.

Input:

I \in R^{H \times W \times 3}

(1) Set initial learning rate = 0.001, Batch size = 16

(2) for I = 1 to 200 do
(a) Extract spatial features using convolution block
(b) Flatten the features to D dimension
(c) Generate query (q) in a sequence with the key (k) and value (v)
(d) Perform SA using Equation (4) and MSA using Equation (6)
(e) Perform upsampling of spatial features using decoder and gated attention

(3) Plot training loss using Equation (10)

(4) Plot ROC curve using precision and recall

Output:

I \in R^{H \times W \times 1}

2.4. Mathematical Approach for Performance Measures

The performance of the binary image segmentation task is measured using precision (P), recall (R), F-score and intersection over union (IoU). This parameter is mathematically expressed in terms of true positive (TP), which measures pixels correctly identified as foreground, and true negative (TN), which specifies pixels correctly identified as background. False positive (FP) measures foreground pixels incorrectly identified as background. On the other hand, false negative (FN) specifies background pixels incorrectly identified as foreground.

Precision (P): Precision can be defined as the ratio of accurately predicted positive observations to the total number of positive predictions.

P = \frac{T P}{T P + F P}

(12)

Recall (R): The ratio of correctly predicted positive observations to all observations made during the actual class.

P = \frac{T P}{T P + F N}

(13)

F1-Score: The F1-Score measures the harmonic mean between precision and recall.

F 1 - s c o r e = 2 \times \frac{(P \times R)}{(P + R)}

(14)

Intersection over Union (IoU): The IoU can be defined as the ratio of the predicted segmentation mask’s intersection with the ground-truth mask to the union of both masks.

I o U = \frac{T P}{T P + F N + F P}

(15)

2.5. Quantitative Results

In the proposed study, we have evaluated our method’s performance with Unet [39], HED [40], SegNet [41], SRN (Side-output Residual Network) [42], FPHBN (Feature Pyramid and Hierarchical Boosting Network) [43], and SwinUnet [44]. The 23 layers of the Unet are composed of 3 × 3 convolution layers, followed by a ReLU (Rectified Linear Unit) with a stride of 2 and a max-pooling of 2 × 2. The HED network utilized pre-trained VGG16, and on top of the model they applied deep supervised nets and an FCN (Fully Convolutional network). Further, SegNet is an encoder and decoder-based model with 13 convolution layers followed by BN (batch normalization), max pooling, and a stride of size 2. The SRN is a residual network with deep-to-shallow layers. The FPHBN extract contextual information through top-down architecture. Further, complex samples are reweighted using a hierarchical boosting module that contains sample layers. The SwinUnet is a transformer and convolution-based model with an encoder and decoder module. The encoder block uses shifted-window attention in the transformer block. Meanwhile, the decoder block uses only convolutional blocks.

3. Results

This section includes quantitative and visual results of HCTNet and six other state-of-the-art methods on the three datasets.

3.1. Dataset Description

The proposed model is evaluated on three datasets for sustainable road safety. The crack500 contains 500 RGB images with a resolution of 1440 × 2560 pixels, collected from the main road of Temple University. The second dataset is the DeepCrack dataset, consisting of 537 images with resolution 544 × 384 pixels in binary annotation. The third dataset consists of 1532 images collected from Shri Radha City, Mathura India; the images of the dataset were captured using a Realme 10 pro mobile phone (New Delhi, India). The detailed description of the capturing device is shown in Table 1.

The image masks are generated using the LabelMe annotation tool. In this tool the image is loaded and the crack area is selected using polygon. After that, the. jason file of segmented region is created. Finally, images are saved in .png format. The sample images and their binary masks are shown in Figure 3.

3.2. Experiment Setup

The proposed method is experimented on NVIDIA Qudaro RTX 4000 GPU. It contains 128 GB RAM, Windows 10, and a dual graphics card of 8 GB. Python 3.8 and TensorFlow (v2.0) were used to write the script, and Adam optimizer with an initial rate of 0.001 was used to accelerate the training process. In addition, for each experiment, we fed images of size 300 × 300 pixels to the model for training.

3.3. Performance Evaluation on Crack500 Dataset

The Crack500 dataset contains 500 RGB images. To avoid overfitting the model, we utilized data augmentation techniques, such as vertical flip, horizontal flip, and rotation (30°). After augmentation, we kept 1658 images in the dataset and randomly split 20% for validation and 80% for training. Additionally, each model is trained in batches of 16 for 200 epochs. The performance of the models is shown in Table 2. In Table 2, we can notice that the IoU value of the HED is 73.27%. The second-lowest value is achieved by the Unet. Furthermore, an improvement in the IoU value can be noticed from the SegNet and SRN. The SwinUnet achieves the second-highest value of 96.34%. However, the proposed HCTNet obtained an IoU and F1-score of 98.72% and 98.07%, respectively.

3.4. Performance on DeepCrack Dataset

The DeepCrack dataset contains 537 images; for the model’s generalization, we increase the dataset size using data augmentation techniques. Furthermore, each model was trained for 200 epochs in a batch size of 16. The performance measures on the DeepCrack dataset are shown in Table 3. In Table 3, we can observe that the SwinUnet obtained the highest recall value, of 98.52%. Meanwhile, HCTNet achieved the second-highest value, 98.27%. The HED received the lowest value, of 78.28%. The Unet, SegNet and FPHBN recall values are 91.23%, 94.18% and 96.43%, respectively.

3.5. Performance Evaluation on Proposed Dataset

We have created a new dataset that contains 1532 images. For a fair performance comparison, in each experiment we randomly divided the dataset into two parts: 80% for training and 20% for validation. After splitting the dataset, we trained each model in a batch size of 16 for 200 epochs. The performance measures on the proposed dataset are shown in Table 4. In Table 4, we notice that the lowest precision value is obtained by the Unet. The proposed HCTNet obtained the highest precision value, of 97.52%, whereas SwinUnet obtained the second highest value. However, SwinUnet obtained the highest IoU value, of 96.87%.

3.6. Qualitative Results

The classification maps help with model interpretability, performance evaluation, and effective communication. In this study, we generated the classification map of each model on Crack500, DeepCrack, and the proposed dataset, which are shown in Figure 4, Figure 5, and Figure 6, respectively. We can observe that the visual map of the Unet [39] and HED [40] for thin cracks have more noise compared to the SegNet. On the other hand, these models showed less noise in large-crack areas. On the other hand, FPHBN [43] and SwinUnet [44] have less noise on thin- and large-crack areas on the Crack500 and DeepCrack datasets. The model’s visual map on the proposed dataset is relatively high, compared to the crack500 and DeepCrack datasets.

4. Discussion

Road crack detection is crucial for maintaining sustainable road safety, cost-efficiency, and infrastructure longevity. Early detection of cracks prevents accidents, saves money on extensive repairs, and helps preserve the integrity of road structures. Well-maintained roads ensure smooth traffic flow, reducing congestion and vehicle maintenance costs. Proactive maintenance based on crack detection is also essential for asset management, legal compliance, public satisfaction, and fostering technological advancements in infrastructure maintenance practices. Moreover, it has positive environmental impacts by promoting less fuel consumption and reduced emissions. Several methods using deep learning and machine learning are utilized to detect cracks. However, due to handcrafted characteristics, the machine learning-based solution falls short of achieving high accuracy.

In contrast, the deep learning-based method improved the performance by extracting high-dimensional spatial features. Nevertheless, it fails to provide a long-range dependency on spatial features. At the same time, computational costs also increase. In the proposed study, we design HCTNet in which spatial features passed to the ViT block after convolution embedding are extracted using a 2D convolutional layer of size 64. The ViT provides a long-range dependency of the features. After that, features are passed to a decoder block that contains gated attention to provide local attention to the spatial features. Finally, prediction is made through the Softmax layer. The quantitative and visual results of the HCTNet have been discussed in Section 3. We achieved much better performance compared to the state-of-the-art method. Further, we have plotted the training-loss curve of the model on the three datasets, shown in Figure 7. We can see in Figure 7a that training loss on the Crack500 dataset is high, gradually decreases, and reaches close to zero after 150 epochs. Similarly, we can notice that DeepCrack loss reaches close to zero after 125 epochs. However, several high and low picks can be seen on the proposed dataset, although after 175 epochs it reaches close to 0.01.

4.1. The PRC (Precision-Recall Curve)-Based Comparison

PRC analysis provides a method for assessing the effectiveness of segmentation methods and choosing an appropriate decision threshold [45,46]. A CNN generates an output that quantifies the likelihood of a region being part of the cracks. The PRC analysis selects the most effective threshold for each CNN to differentiate between crack and background. The PRC was plotted through a point with the least Euclidean distance to the maximum precision and recall. Higher AUC values signify better model performance when comparing different models. The PRC value of different methods on the Crack500, DeepCrack, and the proposed dataset is shown in Figure 8. We can notice that the PRC area of HED [40] on the Crack500 dataset is 0.753, whereas the Unet has 0.883. The highest PRC value, 0.987, is achieved by the proposed HCTNet and the second highest is obtained by the SwinUnet [44]. On the other hand, each model showed an improved PRC value on the DeepCrack dataset. The HED obtained a more considerable PRC value than Unet. Furthermore, the SegNet and FPHBN achieved 0.961 and 0.975, respectively. In addition, the SwinUnet and proposed HCTNet obtained 0.986 and 0.998, respectively. On the proposed dataset, Unet and HED showed relatively lower PRC values. However, FPHBN [43] and SwinUnet obtained PRC values of 0.931 and 0.975, respectively. The HCTNet achieved a PRC value of 0.982 on the proposed dataset.

4.2. The Bar Plot-Based Comparison

We have plotted the performance measures of different methods on three datasets, shown in Figure 9. On the Crack500 dataset, the precision and recall value of the HED [40] is relatively less than the Unet [39]. However, the SegNet [41] and SRNet [42] show close values. Further, improvement can be seen in FPHBN [43] and SwinUnet [44], and the highest value is achieved by the proposed HCTNet. Furthermore, all the methods showed remarkable performance on the DeepCrack dataset. SwinNet achieved the highest recall value of 98.52%, and HCTNet obtained the highest F1-score, precision and IoU value, of 98.69%, 99.12% and 98.76%, respectively. Figure 7c shows that Unet [39] achieved the least performance on the proposed dataset, whereas SwinNet and HCTNet showed relatively high precision and IoU values.

4.3. Ablation Study

We rigorously evaluated the model performance using different components, shown in Table 5. Table 5 shows that the ViT as an encoder and pure CNN as a decoder have a relatively lower IoU value on the datasets. At the same time, adding the CNN block with ViT in the encoder block improves the IoU value. Furthermore, the decoder’s GA (gated-attention) block enhanced the channel and spatial attention. Due to this, a further improvement in the IoU value is obtained on the three datasets.

4.4. Computation Time Evaluation

We computed the training in minutes (m) and validation time in seconds (s) of the Unet, HED, SegNet, SRN, FPHBN, Swin-UNet and the proposed HCTNet on Crack500, crack, and proposed dataset under same experimental condition and same number of images, as shown in Table 6. The SegNet has least training and validation time on Crack 500 dataset. The computation time of the SRN is the highest. The SwinUnet has a slightly higher training and validation time than the HCTNet on the Crack500. On the crack dataset, a slightly increased training and validation time can be observed due to the complex nature of the cracks and dataset size. Furthermore, our proposed dataset is much larger and more complex than Crack500 and the Crack dataset. Due to this, a further increase in the training and validation time can be observed. On the proposed dataset, the highest training and validation time is taken by the SwinUnet, whereas the SegNet takes the least time. The proposed HCTNet has a slightly higher training time than the SegNet. However, the performance of the HCTNet is much better than that of the traditional CNN-based and ViT-based methods.

4.5. Performance Evaluation Using Different Loss Function

The effect of different loss functions on the model performance is shown in Table 7. We notice that the IoU value of the proposed model on the Crack 500, DeepCrack, and proposed dataset using binary cross-entropy is relatively low, compared to the Dice loss. However, the proposed hybrid loss showed 1.07%, 1.52% and 1.19% improvement on the Crack 500, DeepCrack, and proposed dataset.

4.6. Training-Accuracy Curve

The training accuracy of the HCTNet on Crack500, DeepCrack and the proposed dataset is shown in Figure 10. In Figure 10a, we can notice that the initial training accuracy is close to 63, and it started increasing in the substituent epochs. Furthermore, it crosses 95% training accuracy after 75 epochs and reaches more than 99% after 175 epochs. On the DeepCrack dataset, initial training accuracy is 65%, as shown in Figure 10c. After 175 epochs it reaches close to 100%. Moreover, on the proposed dataset, fluctuation in the training accuracy can be 82 epochs. After 100 epochs, training accuracy increases and reaches more than 99% at 175 epochs.

5. Conclusions

The potential safety hazards and structural impairments emphasize the necessity of road crack detection in civil infrastructure maintenance and inspection. The conventional practices for this task are noted for being laborious and slow, mainly due to the manual nature of inspection and measurement involved. We utilized a DL methodology for automatic crack image segmentation to automate and expedite the sustainable road safety. By ingeniously integrating ViT with CNNs alongside the application of data augmentation strategies, we improved the efficiency and accuracy of crack segmentation. The encouraging outcomes, validated through rigorous testing on three datasets, underline the viability of the HCTNet. We achieved recall, F1-score and IoU of 98.54%, 98.07%, 98.72% and 98.27%, 98.69%, 98.76% on the Crack500 and DeepCrack dataset. On the proposed dataset recall, F1-score and IOU are 96.89%, 97.20%, and 97.36%, respectively, which exhibited exemplary precision and high accuracy in building crack segmentation. This work, therefore, makes a significant stride towards automating the crack detection process, setting a robust foundation for future explorations in road safety. Dataset quality and real-world generalization issues may hinder our model’s performance. Further, the requirement for substantial computational resources and annotation costs poses challenges for real-time applications and data preparation. Aiming for real-time processing, exploring multiscale ViTs, and enhancing robustness to varied conditions are potential future work areas. In addition, integrating multi-modal data, domain adaptation techniques and automated annotation methods, and optimizing the model for edge-device deployment could significantly broaden the model’s practical utility and scalability.

Author Contributions

Conceptualization, D.P.Y., B.S. and S.C.; data curation, D.P.Y. and B.S.; formal analysis, S.C.; investigation, F.A. and R.A.; methodology, D.P.Y., B.S. and S.C.; project administration, F.A. and R.A.; software, B.S.; visualization, F.A. and R.A.; writing—original draft, D.P.Y., B.S. and S.C.; writing—review and editing, F.A. and R.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset used in the study can be accessed through URL: Crack 500 dataset: https://github.com/fyangneil/pavement-crack-detection (accessed on 11 April 2024), Crack Dataset: https://github.com/yhlleo/DeepCrack/tree/master/dataset (accessed on 11 April 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Kirthiga, R.; Elavenil, S. A survey on crack detection in concrete surface using image processing and machine learning. J. Build. Pathol. Rehabil. 2024, 9, 15. [Google Scholar] [CrossRef]
Hamishebahar, Y.; Guan, H.; So, S.; Jo, J. A comprehensive review of deep learning-based crack detection approaches. Appl. Sci. 2022, 12, 1374. [Google Scholar] [CrossRef]
Islam, M.M.; Hossain, M.B.; Akhtar, M.N.; Moni, M.A.; Hasan, K.F. CNN based on transfer learning models using data augmentation and transformation for detection of concrete crack. Algorithms 2022, 15, 287. [Google Scholar] [CrossRef]
Kheradmandi, N.; Mehranfar, V. A critical review and comparative study on image segmentation-based techniques for pavement crack detection. Constr. Build. Mater. 2022, 321, 126162. [Google Scholar] [CrossRef]
Hoang, N.D.; Nguyen, Q.L. A novel method for asphalt pavement crack classification based on image processing and machine learning. Eng. Comput. 2019, 35, 487–498. [Google Scholar] [CrossRef]
Ibrahim, H.B.; Salah, M.; Zarzoura, F.; El-Mewafi, M. Smart monitoring of road pavement deformations from UAV images by using machine learning. Innov. Infrastruct. Solut. 2024, 9, 16. [Google Scholar] [CrossRef]
Hoang, N.D.; Nguyen, Q.L. Automatic recognition of asphalt pavement cracks based on image processing and machine learning approaches: A comparative study on classifier performance. Math. Probl. Eng. 2018, 2018, 6290498. [Google Scholar] [CrossRef]
Song, W.; Jia, G.; Jia, D.; Zhu, H. Automatic pavement crack detection and classification using multiscale feature attention network. IEEE Access 2019, 7, 171001–171012. [Google Scholar] [CrossRef]
Sari, Y.; Prakoso, P.B.; Baskara, A.R. Road crack detection using support vector machine (SVM) and OTSU algorithm. In Proceedings of the 2019 6th International Conference on Electric Vehicular Technology (ICEVT), Bali, Indonesia, 18–21 November 2019; pp. 349–354. [Google Scholar]
Meng, F.; Li, A. Pavement crack detection using sketch token. Procedia Comput. Sci. 2018, 139, 151–157. [Google Scholar] [CrossRef]
Hoang, N.D.; Nguyen, Q.L.; Tien Bui, D. Image processing–based classification of asphalt pavement cracks using support vector machine optimized by artificial bee colony. J. Comput. Civ. Eng. 2018, 32, 04018037. [Google Scholar] [CrossRef]
Inkoom, S.; Sobanjo, J.; Barbu, A.; Niu, X. Prediction of the crack condition of highway pavements using machine learning models. Struct. Infrastruct. Eng. 2019, 15, 940–953. [Google Scholar] [CrossRef]
Inkoom, S.; Sobanjo, J.; Barbu, A.; Niu, X. Pavement crack rating using machine learning frameworks: Partitioning, bootstrap forest, boosted trees, Naïve bayes, and K-Nearest neighbors. J. Transp. Eng. Part B Pavements 2019, 145, 04019031. [Google Scholar] [CrossRef]
Nguyen, S.D.; Tran, T.S.; Tran, V.P.; Lee, H.J.; Piran, M.J.; Le, V.P. Deep learning-based crack detection: A survey. Int. J. Pavement Res. Technol. 2023, 16, 943–967. [Google Scholar] [CrossRef]
Branikas, E.; Murray, P.; West, G. A novel data augmentation method for improved visual crack detection using generative adversarial networks. IEEE Access 2023, 11, 22051–22059. [Google Scholar] [CrossRef]
Ai, D.; Jiang, G.; Lam, S.K.; He, P.; Li, C. Computer vision framework for crack detection of civil infrastructure—A review. Eng. Appl. Artif. Intell. 2023, 117, 105478. [Google Scholar] [CrossRef]
Shang, J.; Xu, J.; Zhang, A.A.; Liu, Y.; Wang, K.C.; Ren, D.; Zhang, H.; Dong, Z.; He, A. Automatic Pixel-level pavement sealed crack detection using Multi-fusion U-Net network. Measurement 2023, 208, 112475. [Google Scholar] [CrossRef]
Siriborvornratanakul, T. Pixel-level thin crack detection on road surface using convolutional neural network for severely imbalanced data. Comput. Aided Civ. Infrastruct. Eng. 2023, 38, 2300–2316. [Google Scholar] [CrossRef]
Yadav, D.P.; Kishore, K.; Gaur, A.; Kumar, A.; Singh, K.U.; Singh, T.; Swarup, C. A Novel Multi-Scale Feature Fusion-Based 3SCNet for Building Crack Detection. Sustainability 2022, 14, 16179. [Google Scholar] [CrossRef]
Zhang, A.; Wang, K.C.; Fei, Y.; Liu, Y.; Chen, C.; Yang, G.; Li, J.Q.; Yang, E.; Qiu, S. Automated pixel-level pavement crack detection on 3D asphalt surfaces using a deep-learning network. Comput. Aided Civ. Infrastruct. Eng. 2017, 32, 805–819. [Google Scholar] [CrossRef]
Zou, Q.; Zhang, Z.; Li, Q.; Qi, X.; Wang, Q.; Wang, S. Deepcrack: Learning hierarchical convolutional features for crack detection. IEEE Trans. Image Process. 2018, 28, 1498–1512. [Google Scholar] [CrossRef] [PubMed]
Shen, Y.; Yu, Z.; Li, C.; Zhao, C.; Sun, Z. Automated Detection for Concrete Surface Cracks Based on Deeplabv3+, B.D.F. Buildings 2023, 13, 118. [Google Scholar] [CrossRef]
Ji, J.; Wu, L.; Chen, Z.; Yu, J.; Lin, P.; Cheng, S. Automated pixel-level surface crack detection using U-Net. In Proceedings of the Multi-Disciplinary Trends in Artificial Intelligence: 12th International Conference, MIWAI 2018, Hanoi, Vietnam, 18–20 November 2018; pp. 69–78. [Google Scholar]
Zhang, Z.; Liu, Q.; Wang, Y. Road extraction by deep residual u-net. IEEE Geosci. Remote Sens. Lett. 2018, 15, 749–753. [Google Scholar] [CrossRef]
Quan, T.M.; Hildebrand DG, C.; Jeong, W.K. Fusionnet: A deep fully residual convolutional neural network for image segmentation in connectomics. Front. Comput. Sci. 2021, 3, 613981. [Google Scholar] [CrossRef]
Guo, C.; Gao, W.; Zhou, D. Research on road surface crack detection based on SegNet network. J. Eng. Appl. Sci. 2024, 71, 54. [Google Scholar] [CrossRef]
Nguyen, D.K.; Tran, T.T.; Nguyen, C.P.; Pham, V.T. Skin lesion segmentation based on integrating efficientnet and residual block into U-Net neural network. In Proceedings of the 2020 5th International Conference on Green Technology and Sustainable Development (GTSD), Ho Chi Minh City, Vietnam, 27–28 November 2020; pp. 366–371. [Google Scholar]
Liu, Y.; Yao, J.; Lu, X.; Xie, R.; Li, L. DeepCrack: A deep hierarchical feature learning architecture for crack segmentation. Neurocomputing 2019, 338, 139–153. [Google Scholar] [CrossRef]
Hou, Y.; Liu, S.; Cao, D.; Peng, B.; Liu, Z.; Sun, W.; Chen, N. A deep learning method for pavement crack identification based on limited field images. IEEE Trans. Intell. Transp. Syst. 2022, 23, 22156–22165. [Google Scholar] [CrossRef]
Qu, Z.; Wang, C.Y.; Wang, S.Y.; Ju, F.R. A method of hierarchical feature fusion and connected attention architecture for pavement crack detection. IEEE Trans. Intell. Transp. Syst. 2022, 23, 16038–16047. [Google Scholar] [CrossRef]
Chen, H.; Lin, H. An effective hybrid atrous convolutional network for pixel-level crack detection. IEEE Trans. Instrum. Meas. 2021, 70, 5009312. [Google Scholar] [CrossRef]
Al-Huda, Z.; Peng, B.; Algburi RN, A.; Al-antari, M.A.; Rabea, A.J.; Zhai, D. A hybrid deep learning pavement crack semantic segmentation. Eng. Appl. Artif. Intell. 2023, 122, 106142. [Google Scholar] [CrossRef]
Pan, Z.; Lau, S.L.; Yang, X.; Guo, N.; Wang, X. Automatic pavement crack segmentation using a generative adversarial network (GAN)-based convolutional neural network. Results Eng. 2023, 19, 101267. [Google Scholar] [CrossRef]
Zhang, J.; Li, Y.; Jiang, Z.; Xu, S. Multi-Region Segmentation Pavement Crack Detection Method Based on Deep Learning. Int. J. Pavement Res. Technol. 2023, 1–11. [Google Scholar] [CrossRef]
Liu, H.; Yang, J.; Miao, X.; Mertz, C.; Kong, H. CrackFormer Network for Pavement Crack Segmentation. IEEE Trans. Intell. Transp. Syst. 2023, 24, 9240–9252. [Google Scholar] [CrossRef]
Valanarasu, J.M.J.; Oza, P.; Hacihaliloglu, I.; Patel, V.M. Medical transformer: Gated axial-attention for medical image segmentation. In Proceedings of the Medical Image Computing and Computer Assisted Intervention—MICCAI 2021: 24th International Conference, Part I 24, Strasbourg, France, 27 September–1 October 2021; Springer International Publishing: Berlin/Heidelberg, Germany, 2021; pp. 36–46. [Google Scholar]
Oktay, O.; Schlemper, J.; Folgoc, L.L.; Lee, M.; Heinrich, M.; Misawa, K.; Mori, K.; McDonagh, S.; Hammerla, N.Y.; Kainz, B.; et al. Attention u-net: Learning where to look for the pancreas. arXiv 2018, arXiv:1804.03999. [Google Scholar]
Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18. Springer International Publishing: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Xie, S.; Tu, Z. Holistically-nested edge detection. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 1–13 December 2015; pp. 1395–1403. [Google Scholar]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef] [PubMed]
Ke, W.; Chen, J.; Jiao, J.; Zhao, G.; Ye, Q. SRN: Side-output residual network for object symmetry detection in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1068–1076. [Google Scholar]
Yang, F.; Zhang, L.; Yu, S.; Prokhorov, D.; Mei, X.; Ling, H. Feature pyramid and hierarchical boosting network for pavement crack detection. IEEE Trans. Intell. Transp. Syst. 2019, 21, 1525–1535. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Dhiman, P.; Kukreja, V.; Manoharan, P.; Kaur, A.; Kamruzzaman, M.M.; Dhaou, I.B.; Iwendi, C. A novel deep learning model for detection of severity level of the disease in citrus fruits. Electronics 2022, 11, 495. [Google Scholar] [CrossRef]
Kukreja, V.; Dhiman, P. A Deep Neural Network based disease detection scheme for Citrus fruits. In Proceedings of the 2020 International Conference on Smart Electronics and Communication (ICOSEC), Trichy, India, 10–12 September 2020; pp. 97–101. [Google Scholar]

Figure 1. The proposed HCTNet architecture for crack localization.

Figure 2. Channel- and spatial-attention block for each decoding path.

Figure 3. Sample images of the proposed dataset (a) and (c) and their binary mask in (b) and (d), respectively.

Figure 4. The original image, ground truth, Unet, HED, SegNet, SRN, FPHBN, SwinUnet and HCTNet visual map on Crack500 are shown in (a), (b), (c), (d), (e), (f), (g), (h), (i) and (j), respectively, (zoom in for better view).

Figure 5. The Original image, ground truth, Unet, HED, SegNet, SRN, FPHBN, SwinUnet and HCTNet visual map on the DeepCrack are shown in (a), (b), (c), (d), (e), (f), (g), (h), (i) and (j), respectively, (zoom in for better view).

Figure 6. The original image, ground truth, Unet, HED, SegNet, SRN, FPHBN, SwinUnet and HCTNet visual map on the proposed dataset are shown in (a), (b), (c), (d), (e), (f), (g), (h), (i) and (j), respectively, (zoom in for better view).

Figure 7. The loss of the proposed method on Crack500, DeepCrack, and the proposed dataset is shown in (a), (b) and (c), respectively.

Figure 8. The PRC-based evaluation of the models on Crack500, DeepCrack, and proposed dataset is shown in (a), (b) and (c), respectively.

Figure 9. The bar plot-based comparison on Crack500, DeepCrack, and proposed dataset is shown in (a), (b) and (c), respectively.

Figure 10. The training-accuracy plot on the (a) Crack 500 (b) DeepCrack and (c) proposed dataset.

Table 1. The description of the image-capturing device.

Image Acquisition Device	Android Phone
Android Model Name	Realme 10 pro plus 5G
Image Resolutions	3060 × 4080 pixels
Focal Length	23.6 mm
Aperture	f/1.75
Camera Rear	108 MP
Exposure Period	1/50 s
Flash	None

Table 2. Performance measures on Crack500 (bold indicates highest value).

Method	IoU (%)	Precision (%)	Recall (%)	F1-Score (%)
Unet [39]	85.19	86.24	81.63	83.96
HED [40]	73.27	72.18	75.48	73.79
SegNet [41]	88.76	86.12	92.96	89.40
SRN [42]	92.97	88.54	93.61	92.10
FPHBN [43]	94.12	94.82	92.13	93.45
SwinUnet [44]	96.34	97.74	96.87	97.30
Proposed HCTNet	98.72	97.62	98.54	98.07

Table 3. Performance measures on DeepCrack (bold indicates highest value).

Method	IoU (%)	Precision (%)	Recall (%)	F1-Score (%)
Unet [39]	87.05	84.74	91.23	87.86
HED [40]	87.31	88.39	78.28	83.02
SegNet [41]	93.18	92.73	94.18	93.44
SRN [42]	94.89	95.02	92.83	93.91
FPHBN [43]	96.19	97.82	96.42	97.11
SwinUnet [44]	97.85	96.98	98.52	97.74
Proposed HCTNet	98.76	99.12	98.27	98.69

Table 4. Performance measures on the proposed dataset (bold indicates highest value).

Method	IoU (%)	Precision (%)	Recall (%)	F1-Score (%)
Unet [39]	74.21	73.10	81.25	76.95
HED [40]	81.54	82.78	75.19	78.80
SegNet [41]	82.43	79.82	84.68	82.17
SRN [42]	88.67	87.32	94.15	90.60
FPHBN [43]	92.59	92.37	86.95	89.57
SwinUnet [44]	96.87	96.17	95.32	95.74
Proposed HCTNet	97.36	97.52	96.89	97.20

Table 5. Effects of different components on the model performance.

Dataset	Encoder	Decoder	IOU
Crack500	ViT	CNN	95.17%
	Conv2D+ViT	CNN	97.06
	Conv2D+ViT	CNN+GA	98.72%
Crack	ViT	CNN	96.05%
	Conv2D+ViT	CNN	97.18%
	Conv2D+ViT	CNN+GA	98.76%
Proposed	ViT	CNN	94.23%
	Conv2D+ViT	CNN	95.73%
	Conv2D+ViT	CNN+GA	97.36%

Table 6. Computation time analysis on the different datasets.

Model	Train (m)	Val(s)	Train (m)	Val(s)	Train (m)	Val(s)
	Crack500		Crack		Proposed Dataset
Unet	187	65	210	76	321	193
HED	165	47	183	64	280	127
SegNet	153	42	167	53	265	118
SRN	218	85	236	97	376	215
FPHBN	176	65	192	79	287	135
Swin-UNet	201	81	207	86	357	206
HCTNet	172	57	184	68	296	197

Table 7. Effect of different loss function on model performance.

Dataset	Binary Cross-Entropy	Dice	Proposed Hybrid
	IoU	IoU	IoU
Crack 500	96.86%	97.65%	98.72%
DeepCrack	97.17%	97.24%	98.76%
Proposed dataset	96.05%	96.17%	97.36%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yadav, D.P.; Sharma, B.; Chauhan, S.; Amin, F.; Abbasi, R. Enhancing Road Crack Localization for Sustainable Road Safety Using HCTNet. Sustainability 2024, 16, 4409. https://doi.org/10.3390/su16114409

AMA Style

Yadav DP, Sharma B, Chauhan S, Amin F, Abbasi R. Enhancing Road Crack Localization for Sustainable Road Safety Using HCTNet. Sustainability. 2024; 16(11):4409. https://doi.org/10.3390/su16114409

Chicago/Turabian Style

Yadav, Dhirendra Prasad, Bhisham Sharma, Shivank Chauhan, Farhan Amin, and Rashid Abbasi. 2024. "Enhancing Road Crack Localization for Sustainable Road Safety Using HCTNet" Sustainability 16, no. 11: 4409. https://doi.org/10.3390/su16114409

APA Style

Yadav, D. P., Sharma, B., Chauhan, S., Amin, F., & Abbasi, R. (2024). Enhancing Road Crack Localization for Sustainable Road Safety Using HCTNet. Sustainability, 16(11), 4409. https://doi.org/10.3390/su16114409

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhancing Road Crack Localization for Sustainable Road Safety Using HCTNet

Abstract

1. Introduction

2. Materials and Methods

2.1. The HCTNet

2.2. The Decoder with Gated Attention

2.3. The Hybrid Loss Function

2.4. Mathematical Approach for Performance Measures

2.5. Quantitative Results

3. Results

3.1. Dataset Description

3.2. Experiment Setup

3.3. Performance Evaluation on Crack500 Dataset

3.4. Performance on DeepCrack Dataset

3.5. Performance Evaluation on Proposed Dataset

3.6. Qualitative Results

4. Discussion

4.1. The PRC (Precision-Recall Curve)-Based Comparison

4.2. The Bar Plot-Based Comparison

4.3. Ablation Study

4.4. Computation Time Evaluation

4.5. Performance Evaluation Using Different Loss Function

4.6. Training-Accuracy Curve

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI