Lightweight Crack Automatic Detection Algorithm Based on TF-MobileNet

Yu, Jiantao; Qian, Songrong; Chen, Cheng

doi:10.3390/app14199004

Open AccessArticle

Lightweight Crack Automatic Detection Algorithm Based on TF-MobileNet

by

Jiantao Yu

¹,

Songrong Qian

^2,*

and

Cheng Chen

¹

School of Mechanical Engineering, Guizhou University, Guiyang 550025, China

²

State Key Laboratory of Public Big Data, Guizhou University, Guiyang 550025, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(19), 9004; https://doi.org/10.3390/app14199004 (registering DOI)

Submission received: 31 August 2024 / Revised: 30 September 2024 / Accepted: 3 October 2024 / Published: 6 October 2024

Download

Browse Figures

Versions Notes

Abstract

:

With the progress of social life, the aging of building facilities has become an inevitable phenomenon. The efficiency of manual crack detection is limited, so it is necessary to explore intelligent detection technology. This article proposes a novel crack detection method TF-MobileNet. We took into account the effect of lightweight and crack feature extraction, so we developed a novel crack feature extraction backbone network, which combined Transformer and MobileNetV3. Then we improved the feature fusion network by using the multi-headed attention mechanism of the Bottleneck Transformer, which enables the feature fusion effect to be improved. Then, we integrated SENet and SimAM attention mechanisms into the networks used for feature extraction and feature fusion, thereby further improving the crack detection performance. Finally, we deployed our model in edge devices (NVIDIA Jeston Nano). The findings indicate that our proposed model has achieved 90.8% mAP on the dataset and worked well on the edge device side, which meet the requirements of automatic crack detection. Our model enables real-time monitoring of pavement using edge devices. This approach allows for timely maintenance and repair of the pavement. In the future, we can train the model to recognize more pavement distress features, addressing road safety issues effectively.

Keywords:

TF-MobileNet; automatic crack detection; lightweighting; attention mechanism; edge deployment

1. Introduction

Pavement cracking is one of the most common forms of transportation infrastructure failure [1]. After prolonged or excessive use, most pavement distresses first manifest as cracks. If not repaired promptly, these cracks can expand and worsen due to external environmental or human factors [2]. Driving at high speeds on such roads not only affects driving comfort but can also contribute to serious traffic accidents. The continued expansion of small cracks can lead to potential hazards. Repairing cracks before they deteriorate can significantly reduce maintenance costs, prevent accidents, and extend the lifespan of the road [3,4]. Crack detection has become a crucial component of concrete structure maintenance to evaluate the safety and durability of these structures [1].

Traditional crack detection relied on either manual visual inspection or various technological methods such as electromagnetic [5], infrared [6], ultrasonic [7], or radioactive detection [8]. Manual inspection not only suffers from low efficiency but also incurs high labor costs and faces geographical constraints, making it unsuitable for areas like viaduct pavements and bridge piers. Despite continuous advancements in detection techniques using external equipment, these methods are still susceptible to subjective influences.

These drawbacks have driven researchers and road maintenance companies to explore more efficient and comprehensive crack detection methods. With the continuous advancements in computer hardware, computer vision, and artificial intelligence, image-based crack detection has become a mainstream trend. Current image-based deep learning crack detection methods require a significant quantity of data to train the algorithmic models. It takes time to collect the crack data and upload it to the computing side. Additionally, cloud computing has high latency and cannot respond quickly to incidents generated by cracks. Edge computing, characterized by low latency and fast response, can detect cracks immediately and assist road operation and maintenance companies in completing subsequent repair work. Considering the limitations of traditional crack detection methods and leveraging the benefits of edge computing technologies and deep learning, this research aims to propose a lightweight crack detection model that can be easily deployed on edge devices to achieve a fast response to cracks. The following are the specific outcomes derived from this research:

We have developed a novel deep learning algorithm, TF-MobileNet, specifically designed for crack detection.

We combined the lightweight network MobileNetV3 with a Transformer to build the backbone network for tiny crack feature extraction and introduced the attention mechanism into the backbone network, which effectively allows the network to place greater emphasis on identifying the characteristics of cracks.

We combined BoTNet and PANet to build a crack feature fusion network and introduced a parameter-free attention mechanism: SimAM, which emphasizes the crack features without imposing excessive computational effort on the network. We performed ablation experiments to confirm the efficacy of each individual module.

We have deployed the model to edge devices and performed crack detection at the edge side under two different environmental conditions, verifying the feasibility of the lightweight model.

The organization of this paper is laid out as follows. In Section 2, we review the outstanding previous research in the field of crack detection. In Section 3, we provide an overview of the overall structure of the proposed lightweight model. In Section 4, we train the models and conduct comparative experiments, reporting various evaluation criteria for all models in the experiments. In Section 5, we deploy the proposed model on edge devices and perform both image and video tests. In Section 6, we summarize the overall work.

2. Related Works

Pavement condition assessment has transitioned from manual methods to automated methods, with the automatic identification of pavement cracks using computer vision technology being a key element of pavement crack detection systems. To automate crack detection methods and address issues related to cracks, most modern pavement crack analyzers utilize computer vision and digital image processing techniques [9]. To better perform this task of crack detection, image acquisition techniques have become increasingly established and uniform in the last twenty years. Image capturing modules are designed to capture pavement images in 2D or 3D, primarily performed by high-speed vehicles through an imaging device installed in the rear of the automotive system, which is designed to record images of cracks on the road sections the vehicle passes through [10]. The deep convolutional network has unique advantages in image processing tasks due to its large number of parameters, and the utilization of target detection algorithms based on deep learning presents a viable solution to the challenge faced by conventional detection methods, which rely on manual feature extraction of pavement cracks.

2.1. Crack Identification Using Deep Learning Techniques

Deep learning is a subfield of machine learning. It not only performs well in classification tasks, but researchers have increasingly found that deep learning also excels in object detection and semantic segmentation, yielding favorable outcomes. In the domain of crack identification, Nie et al. [11] presented a crack detection method based on YOLOv3 for pavement. They manually labeled the collected crack patterns and then trained the model to achieve 88% detection accuracy on their constructed dataset. In another study [12], the authors utilized a Faster R-CNN, which was pre-trained to detect crack images under different shooting conditions, and investigated the effects of weather conditions and light levels on crack detection. They found that crack detection was almost unaffected under clear, cloudy, and foggy conditions, but the precision of identifying cracks was greatly affected under sunset or moonlight. Yu et al. [13] developed a crack detection model, YOLOv4-FPM, based on an enhanced version of YOLOv4 designed for real-time application. They first optimized the loss function using focal loss to enhance accuracy in identifying cracks against complex backgrounds. Subsequently, they pruned the algorithm, shrinking the network size by 81.8%, which significantly increased detection speed, albeit with some accuracy loss. Zhang et al. [14] refined YOLOv3 for UAV crack data by replacing Leaky ReLU activation functions with Mish, except in the feature fusion network. They introduced a Multi-level Attention Block (MLAB), adding it after the backbone, resulting in a higher mean Average Precision (mAP) on the original database. Tan et al. [15] designed an automatic sewer pipe defect detection method using an enhanced YOLOv3. They applied an adaptive image scaling strategy and replaced the MSE bounding box loss function with GIoU, achieving an average precision of 92% on their dataset. S.A. et al. [16] proposed a novel pavement crack detection method that integrates deep learning for feature extraction, the Whale Optimization Algorithm (WOA) for feature selection, and Random Forest (RF) for classification, achieving high accuracy in crack detection. Chen et al. [17] developed an improved Swin-Unet model, incorporating an enhanced skip attention module and residual Swin-Unet (iSwin-Unet) based on the original Swin-Unet. This model effectively performs global modeling of pavement crack features while efficiently exchanging crack feature information.

In the field of crack segmentation, Liu et al. [18] first applied U-Net for crack segmentation. They selected the focal loss function as their evaluation metric and optimized it using Adam. Finally, they used the trained model to detect images with different illumination, cluttered backgrounds, and different crack widths, demonstrating good results and robustness. Shi et al. [19] put forward a novel approach for road crack detection using a random structure forest algorithm called CrackForest. First, Shi et al. [19] utilized integral channel features to redefine crack characteristics, aiming to better represent cracks with intensity inhomogeneity. Following this, they introduced the use of a random structure forest as a means of creating a high-performing crack detector capable of identifying cracks of arbitrary complexity. Compared with other crack detection methods, their approach was found to be more efficient and simpler to parallelize. Feng et al. [20] put a crack segmentation model relying on a Swim Transformer encoder and decoder combined with multilayer perceptual layers. The Crack Transformer (CT) had been proposed for the automated detection of lengthy and intricate cracks in pavements. They trained on a crack image dataset containing different types of environmental disturbances, and the findings indicated that their put CT model possesses the capability to detect fine cracks well. Zhou et al. [21] utilized a DCNN for the segmentation of cracks. They proposed a new heterogeneous image fusion strategy to incorporate both the intensity and distance images at the pixel level that can help mitigate uncertainties associated with individual data sources.

Research on pavement crack detection utilizing deep learning techniques [11,12,13,14,15,16,17] has been conducted. These studies considered detection under various weather conditions, effectively demonstrating the feasibility of using deep learning for crack detection and achieving promising results. However, due to the high computational costs of these models, they are difficult to deploy on edge devices. Therefore, balancing model accuracy and lightweight design remains a challenge. Image segmentation [18,19,20,21] provides pixel-level classification by labeling each pixel in an image. However, segmentation cannot differentiate between different instances within the same category, meaning it cannot provide specific location information for targets. This presents limitations for instance segmentation tasks in complex scenes.

2.2. Real-Time Detection Methods Applying Edge Computing

The MobileNetV2-SSDLite algorithm was enhanced by Zhang et al. [22] by incorporating a channel attention mechanism into the network. This effectively accentuated defects while suppressing any background noise. To address the issue of imbalance between the number of defective and background candidate frames, a focal loss function was utilized, and the model was ultimately deployed to NVIDIA by Jeston Nano for real-time detection of fabric defects. Tang et al. [23] utilized an edge cloud computing framework to devise an automated linear defect detection system catering to the needs of large PV plants. They developed a new deep learning-based algorithm for PV defect detection and then distributed computational tasks among edge devices, edge servers, and cloud servers. Park et al. [24] used edge devices to improve the speed of data processing and analysis and established an LSTM recurrent neural network-based model for detecting faults in machinery. They then loaded the model onto the edge devices to achieve real-time fault detection. To minimize computational expenses, Zhang et al. [25] introduced the cross-stage lightweight (CSL) module as a novel technique for convolutional operations, focusing on being lightweight. This module replaces point-by-point convolution with deep convolution to generate candidate features. They constructed a lightweight detection algorithm with this module: CSL-YOLO, which was tested on an edge device. In comparison with YOLOv4-Tiny, CSL-YOLO used only 43% of the FLOPs and 52% of the parameters. In their study, Wang et al. [26] developed an intelligent system for detecting surface defects in complex product images, utilizing the Faster R-CNN algorithm. To ensure speedy detection of defects, a cloud-edge computing framework was employed in conjunction with the aforementioned algorithm.

3. Crack Detection Method Based on TF-MobileNet

3.1. Network Structure

In this study, we considered the characteristics of difficult deployment and slow inference speed of large-scale crack detection networks in practical engineering applications. We designed a lightweight network backbone, MobileNetV3 [27], combined with Bottleneck Transformer [28] for the crack feature extraction network. The aim was to satisfy the crack detection accuracy while being easily deployable to edge devices for real-time crack detection.

The MobilenetV3 model is a renowned lightweight architecture that combines the depthwise separable convolutions from MobilenetV1 [29] with the Inverted Residuals, Linear Bottleneck, and SE (Squeeze-and-Excitation) modules from MobilenetV2 [30]. MobilenetV3 has not only enhanced the model’s precision but also significantly decreased the inference time of the network compared to the previous two versions. In MobilenetV3, group convolution is utilized to decrease the network’s computational complexity, with the number of groups equal to the number of channels in the network, effectively minimizing computation. Afterward, a 1 × 1 convolution is used to merge the channels. MobilenetV3 introduces the Bottleneck structure and the SE attention mechanism module, which automatically derive the significance of individual feature channels, promoting useful features and suppressing less useful ones. The intricate background and variability in pavement cracking characteristics pose a challenge, so we incorporated the Transformer structure in the backbone network to improve MobileNetV3. The multi-head Attention and Self-Attention of the Transformer enable the feature extraction network to increase focus on crack characteristics, enhancing the baseline for crack detection. In the feature fusion, we used Feature Pyramid Networks (FPN) as well as Path Aggregation Networks (PAN) [31]. Under the condition that the shallow location information did not affect the deep features, to augment feature representation at different scales, we integrate rich semantic information with superficial location details. We also incorporated the attention mechanism during the phase of feature fusion, and to make the model lightweight, we chose the SimAM [32] Parameter-free attention mechanism module. The incorporation of the attention mechanism made the model highlight the crack features and enhance the fusion effect. Finally, we chose Complete IoU Loss (CIoU) [33] as our localization loss function and Binary Cross Entropy Loss (BCELoss) as our classification loss function.

Figure 1 depicted the detailed network architecture. The structural designs include a feature extraction network based on MobileNetV3. MobileNetV3 introduces Transformer architecture and SE (Squeeze-and-Excitation) attention mechanism modules, which automatically infer the importance of each feature channel. These mechanisms enhance the useful features while suppressing less relevant ones, allowing the TF-MobileNet feature extraction network to efficiently and quickly extract information from images. In the head section, the paper combines Feature Pyramid Network (FPN) and Path Aggregation Network (PAN) for feature fusion, which improves detection performance across various scenarios. Additionally, the C2f module from YOLOv8, SimAM attention mechanism, and Head-Attention module are incorporated to help TF-MobileNet generate better feature maps. The generated three feature maps have dimensions of 80 × 80 × 256, 40 × 40 × 256, and 20 × 20 × 512, respectively. For example, the 20 × 20 × 512 feature map can be understood as having 20 × 20 pixels, with each pixel containing 512 pieces of information. Finally, image prediction is performed. Figure 1 clearly illustrates how these components are integrated into the entire network.

3.2. Feature Extraction Network of TF-MobileNet

Since we aim to build a lightweight algorithm for automatic crack detection, we adopt MobileNetV3 integrated with Transformer as the backbone of our network for extracting crack features. MobileNetV3 is a well-performing lightweight network that combines the advantages of the v1 and v2 versions and uses the nonlinear h-swish [27] function instead of swish [34]. The h-swish function brings faster computational speed and is more quantization-friendly. Equations (1) and (2) show the computations of h-swish and swish.

h - s w i s h [x] = x \frac{Re L U 6 (x + 3)}{6}

(1)

f (x) = x \cdot s i g m o i d (β x)

(2)

To reduce the latency while maintaining high-dimensional features, the previous layer of the average pooling layer in the inverted bottleneck structure in MobileNetV2 was removed and the feature map was computed using convolution. Figure 2 shows the changed inverted bottleneck structure.

Although MobileNetV3 has the advantage of being lightweight, it does not perform particularly well in feature extraction, especially for fine and long features such as cracks. With the increase in the number of convolutional layers, the local perceptual field of the feature map expands, which can be detrimental to the detection of fine features of cracks. The introduction of BoTNet does not add much computational effort, but its internal global attention mechanism allows the model to better combine contextual information and solve the problem of small target density more effectively.

To enhance the model’s robustness, we used the Mosaic [35] data enhancement technique. Using four random images for random scaling and then random stitching, the random scaling operation added many small targets, which not only improved the robustness of crack detection but also enriched the dataset.

3.3. Feature Fusion Network of TF-MobileNet

In the feature fusion stage, we used a fusion strategy that combines feature pyramids (FPN) as well as path aggregation networks (PAN). The FPN (Feature Pyramid Network) adopts a top-down architecture that combines high-level semantic information with low-level detailed information. It achieves this by propagating features from high-resolution feature maps downward, merging them with lower-resolution feature maps at each level, forming a pyramid structure. FPN is effective at capturing features across different scales, improving the model’s performance in detecting both small and large objects. By enhancing the semantic information of lower-level feature maps, it increases detection accuracy.

PANet (Path Aggregation Network) builds on FPN by adding a bottom-up path aggregation module. This module combines bottom-up feature maps with top-down feature maps, further strengthening the feature information. PANet not only improves the fusion of multi-scale features but also introduces global contextual information, enhancing object segmentation capabilities. It is particularly well-suited for object detection and segmentation tasks in complex scenarios.

In FPN, because of the process of feature maps from large to small and then from small to large, the feature information at the shallow level was significantly lost after passing through multiple layers. Therefore, a bottom-up path aggregation network (PAN) was after the FPN, and the (i)th feature map of the PAN fuses the (i + 1)th feature map of the FPN to get the new fused feature map. Figure 3 below illustrates the detailed fusion process. This approach enhanced the characterization capability of the features.

This study combines FPN and PANet to leverage the strengths of both. FPN provides rich multi-scale features for detection, while PANet further enhances the feature representation, improving the model’s accuracy in object localization and classification. This fusion contributes to better detection performance across various scenarios, particularly in cases with complex backgrounds and the presence of multi-scale objects.

Next, we inserted the parameter-free attention mechanism, SimAM, into the Neck network. After conducting ablation experiments, we confirmed that this module enhances feature fusion. Additionally, we added a Transformer structure to the Neck network to upgrade the PANet, and the inclusion of a multi-headed self-attentive mechanism further enhanced the feature fusion effect. Finally, the fused feature maps were passed into the prediction side to generate many candidate anchor boxes. We applied non-maximal suppression (NMS) [36] to remove superfluous predictor boxes. By setting the threshold to 0.5, NMS selected the predictor box with the highest score and removed those that significantly overlapped with it. The score values were measured by IoU as shown in Figure 4.

The marker box is represented by the yellow box and the predictor box is represented by the green box. If the proportion of the overlap of the areas of the two boxes to the concatenation surpassed the designated threshold, we considered this prediction box to be true and valid, and we can use Equation (3) to calculate each IoU value. Finally, we used CIoU for localization loss and BCELoss for classification loss estimation.

I o U = \frac{S_{g} \cap S_{p}}{S_{g} \cup S_{p}}

(3)

There are various methods for non-maximum suppression (NMS). This study tested the non-maximum suppression technique described in YOLO during crack image prediction and found that CIoU (Complete Intersection over Union) was more effective. CIoU takes into account the distance between the object and the anchor box, overlap ratio, scale, and a penalty term, making the object box regression more stable and preventing failures during training. Therefore, this study utilized CIoU for localization loss and BCELoss (Binary Cross Entropy Loss) for classification loss estimation.

3.4. Attention Mechanism Module

Attention Mechanism is a special structure that can be directly embedded in machine learning models. It finds extensive applications in natural language processing (NLP) as well as in image processing to help models select effective and appropriate features and discard some features that do not contribute to model training so that the models can complete their tasks more efficiently.

The Squeeze-and-Excitation Networks (SENet) [37] attention module can be adjusted appropriately according to the importance of different channels. Through this attention mechanism, the model can selectively highlight relevant features and suppress insignificant ones with the help of global information. In this way, the model learns to leverage global information to accentuate informative features and downplay unimportant ones. SENet is a lightweight mechanism that leverages channel attention to enhance the representation of the basic modules in the whole network without introducing too many computational parameters. The ablation experimental results also showed that SENet did not introduce a redundant number of floating-point operations. The working mechanism of SENet is shown in Figure 5.

SimAM is a conceptually straightforward yet remarkably efficient attention module that was capable of deducing 3D attention weights for feature maps in layers without introducing any additional parameters to the original network, and it can be plug-and-play in the model. In this study, we inserted this module in the feature fusion phase and the results obtained from the experiments showed that the embedding of this module had a facilitating effect on the feature fusion of the model. The structure of SimAM is shown in Figure 6.

3.5. Loss Function

The loss function plays a crucial role in quantifying the variance between the predicted and actual values, thereby significantly influencing the model’s performance, which included localization loss, confidence loss, and classification loss; so, the whole loss of the model is obtained by adding localization loss, confidence loss and classification loss. Considering the uniformity of the bounding box aspect ratio, we chose CIoU as our localization loss function. Its specific definition is given in the following equation:

L_{C I o U} = 1 - I o U + \frac{ρ^{2} (b, b^{g t})}{c^{2}} + α ν

(4)

where

α

denoted the positive trade-off parameter, which can be defined as

α = \frac{ν}{(1 - I o U) + ν}

(5)

where

ν

was used as a measure of aspect ratio consistency, which can be defined as

ν = \frac{4}{π^{2}} (\arctan \frac{ω^{g t}}{h^{g t}} - \arctan \frac{ω}{h})^{2}

(6)

\frac{ρ^{2} (b, b^{g t})}{c^{2}}

is a regularization term that helps prevent overfitting, where

c

represents the diagonal distance of the minimum bounding box that encloses the true box and the predicted box. The numerator represents the Euclidean distance between the center points of the true box and the predicted box.

In crack detection, there are only two types of cracks and non-cracks, so we chose the BCELoss that allowed the model to converge quickly. The BCELoss is calculated in the following equation:

L o s s = - \frac{1}{n} \sum_{i = 1}^{n} [y_{i} \cdot \log p (y_{i}) + (1 - y_{i}) \cdot \log (1 - p (y_{i}))]

(7)

where

y_{i}

denotes whether the label is cracked or not, and

p (y_{i})

denotes the label value output by the model’s prediction corresponding to the real label

y_{i}

, i.e., as the predicted label approaches the actual label, the loss decreases.

4. Experiment and Analysis

4.1. Experimental Platform

In the training of deep convolutional neural networks, which are supported by a substantial quantity of image data, we chose to train on a supercomputer and a personal server simultaneously to save training time as well as to increase the training speed of the model. We showed some configurations of the supercomputer and the personal server in Table 1 and Table 2. PyTorch was chosen as our deep learning framework and CUDA 11.1 for image processing acceleration.

4.2. Experimental Data

In this study, a total of 3433 images containing cracks were collected during the training, validation, and testing phases, and simple denoising was applied to these images. The dataset was divided into a training set and a validation set at a ratio of approximately 10:1. Therefore, 400 images were randomly selected for validation using Python code.

There are many methods to enhance the model’s robustness and prevent overfitting, such as image sharpening, image thresholding, and feature extraction. In this study, we applied random flip data augmentation techniques. Figure 7 illustrates the data augmentation method.

4.3. Training Parameter Setting

Choosing appropriate hyperparameters during model training is crucial for optimizing performance. The total number of epochs typically ranges from dozens to hundreds, depending on the dataset’s size and complexity. To prevent overfitting, we employed an early stopping method, monitoring the validation loss and halting training when no further improvement was observed. For batch size, we selected common values such as 32 or 64, which helped balance memory constraints and training efficiency. The initial learning rate was usually set to 10⁻³ or 10⁻⁴, and combined with a learning rate scheduling strategy, allowing dynamic adjustments to improve model performance. Weight decay was generally set to 10⁻⁴ or 5 × 10⁻⁴ to prevent overfitting and enhance the model’s generalization ability. Lastly, the maximum learning rate was set to two to three times the initial learning rate, while the minimum learning rate was typically set at 10% to 1% of the initial value, ensuring fine adjustments during the later stages of training. Through careful selection and tuning of these hyperparameters, we aimed to enhance the model’s overall performance and robustness.

In this study, we chose to set the model training epoch to 300 and the batch-size to 8. The initial values selected were a learning rate of 0.001 and a weight decay of 0, the maximum learning rate of 0.01 and the minimum learning rate of 0.0001, and we used the Adam optimization strategy. Since we were given a small initial learning rate, we also used a cosine annealing learning rate decay strategy until the network was parameter-stabilized during the training process, and then accelerated the training. As the training process proceeded, the model gradually approached the optimal solution, at which point the learning rate was reduced in order to update, making the training results infinitely close to the optimal solution.

4.4. Model Evaluation

There are specific model evaluation criteria in the realm of visual computing, and we chose to use these criteria for model evaluation as well. We mainly used precision, recall, and mAP values as our criteria for evaluating models. Precision reflects the probability that the model will classify the sample correctly. To a certain extent, the higher the precision, the better the model can get the task done. Recall represents the ratio of specimens that the model considers as ‘crack’ to all labels, and can reflect how many samples labeled as ‘crack’ are predicted by the model. We define the calculation of precision and recall in detail in Equations (7) and (8):

P = \frac{T P}{(T P + F P)}

(8)

R = \frac{T P}{(T P + F N)}

We use Figure 8 to explain the meaning represented by each indicator in the two equations:

4.5. Ablation Experiments

To balance all aspects of the model’s evaluation metrics as well as the detection speed and parameters. We conducted a sequence of trials not only to prove the efficacy of the modules but also to consider the lightweight model and the high accuracy detection required by the actual edge device deployment process. We validated some major modules inside the model, including the Transformer, the attention module, and the modified PANet. We performed experiments on these three main components separately and then in two-by-two combinations. The findings of these tests are outlined in Table 3. Here, P stands for precision, R represents recall, mAP denotes mean average precision, Para. refers to the number of model parameters, and GFlop indicates the number of gigaflops (billions of floating-point operations per second). These parameters are also applicable in Table 4.

We can conclude from the ablation experiments that the accuracy value of the lightweight MobileNetV3 network in crack detection did not meet the detection requirements, but the combination of the Transformer for feature extraction of cracks significantly improved the performance in all aspects. The improvement of PANet using BoTNet also made the feature fusion better. The introduction of two attention mechanisms also enhanced the network’s focus on the characteristics of cracks, and although SENet brought some parameters, the computational burden brought by these parameters was worthwhile after the integrated improvement of the detection effect.

4.6. Comparison Experiments

The above experiments can only show the effectiveness of each module in the experiment in this model and cannot prove that it is better than other models. Therefore, we performed several experiments to make comparisons with it. Mainly, these models are currently well recognized in the field of target detection, including YOLOv3 [38], YOLOv4 [35], YOLOv5s, and YOLOv7 [39]. However, these models are all based on the YOLO series of algorithms. To validate the advantages of our model, we compared it with a new algorithm, Real-Time DEtection TRansformer (RT-DETR) [40]. All these target detection models are end-to-end one-stage detection models that can provide fast response to crack detection, and many researchers have innovated and improved on these models and have applied them to various inspection fields. In Table 4, Weight represents the size of the weight file after training, with the unit being kilobytes (KB).

As we can see from the above table, our proposed algorithm model measures the crack detection accuracy and the performance of the algorithm in all aspects. Our model significantly outperforms the basic YOLO model in terms of detection accuracy and the number of parameters. Moreover, the number of floating-point operations, which measures model complexity, is also much lower than these models. While the RT-DETR model boasts high detection accuracy, its large number of parameters makes it challenging to deploy on edge devices. Therefore, TF-MobileNet strikes a balance between detection accuracy and lightweight design, making the model simple and lightweight enough while meeting the requirements for crack automatic detection accuracy.

5. Edge Deployment

To meet the requirements of automatic inspection in real engineering, we deployed our model at the edge device side, Figure 9 shows the detection process. The edge device we chose is NVIDIA Jeston Nano, and we also considered that the model can be inspected along with vehicle-mounted devices for inspection, or using drones for overhead inspection. So we used crack images and videos to validate the model respectively, Figure 10 shows image detection.

Simultaneously, we tested crack images from other datasets, as shown in Figure 11. Our model continued to demonstrate good detection results when applied to crack images from these additional datasets.

We also examined 12-s and 14-s videos captured under different weather conditions, as shown in Figure 12 and Figure 13. One video was recorded on a sunny day with favorable environmental conditions, while the other was taken under cloudy conditions, making it less clear compared to the sunny environment.

We also compared with the YOLO of object detection algorithms when deployed to edge devices, including image detection and video detection. In Table 5, we showed the latency of detecting pictures as well as the frame rate of video detection.

As shown in Table 6, we compared the efficiency of the model in detecting images across different devices, including CPU, GPU-a10, GPU-a30, RTX3080, and Jetson Nano. This comparison was conducted to evaluate the performance of our model on various hardware platforms. Although the results may vary slightly due to differences in background processes running on these devices, they still demonstrate the lightweight nature of our model. Given the small number of parameters in our model, the detection time on these devices is very fast and does not consume excessive resources.

6. Conclusions

In our study, we offered a lightweight crack automatic detection algorithm based on computer vision and edge computing. We built a lightweight crack feature automatic extraction network, which used MobileNetV3 and Transformer. The addition of the Transformer well enhanced the effect of crack feature extraction, and the incorporation of a channel attention mechanism enabled the network to prioritize crack features. Then we used Transformer to improve the Path Aggregation Network with Transformer, which further improved the feature fusion effect. Compared to YOLOv5, YOLOv7, and YOLO_CA, our model’s accuracy differs by no more than 2%, but our parameter count is less than 50% of these models. Our model balances detection accuracy and lightweight design, making it simple and portable while meeting the requirements for automatic crack detection accuracy.

We deployed our proposed detection model on an edge device and tested it. We used pictures and videos from two different environmental situations for validation, and the test results showed that the inference time for a single picture was only 11.3 ms, and video detection achieved a speed of 83 frames per second, meeting the detection requirements in practical engineering applications. However, we found that during video detection, some small cracks could not detected if the camera moved too quickly. We suspect that the issue was caused by a mismatch between the model’s response speed and the camera’s movement speed.

Although we have demonstrated the model’s effectiveness in detecting small cracks during rapid camera movement, certain potential limitations must be acknowledged. Firstly, changes in lighting conditions can significantly impact model performance. In particular, under extreme lighting environments, such as strong light or low light, the recognizability of features may be affected. Therefore, future research should consider dataset expansion under various lighting conditions to enhance the model’s robustness.

Secondly, data diversity is key to improving the model’s generalization capability. Future work needs to include samples from different scenes, angles, and backgrounds. Additionally, the model may face challenges when handling overlapping or similarly textured objects, highlighting the need for further technical improvements to boost overall performance. By addressing these limitations, we expect future studies to better handle the complex situations encountered in real-world applications.

Deploying our model on edge devices allows for real-time monitoring of pavement, significantly reducing maintenance budgets and safety hazards while extending the road’s lifespan. However, cracks are just one type of pavement distress. In the future, it will be necessary to incorporate more types of distress data to enhance the model’s applicability.

Author Contributions

In this study, J.Y. built the object detection network model, completed the experiments, and wrote the article. J.Y. recorded the experimental data. C.C. completed the search of the dataset. J.Y. finished revising the grammar of the article. S.Q. provided various guidance. All authors have read and agreed to the published version of the manuscript.

Funding

The authors received no specific funding for this study.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author, S.Q., upon reasonable request.

Acknowledgments

We thank the Supercomputing Center of the State Key Laboratory of Public Big Data of Guizhou University for providing the experimental platform for this research, along with Jian Zhang for providing data support for this experiment.

Conflicts of Interest

The authors declare that they have no conflicts of interest to report regarding the present study.

References

Mohan, A.; Poobal, S. Crack detection using image processing: A critical review and analysis. Alex. Eng. J. 2018, 57, 787–798. [Google Scholar] [CrossRef]
Zaloshnja, E.; Miller, T.R. Cost of crashes related to road conditions, United States, 2006. Ann. Adv. Automot. Med. 2009, 53, 141–153. [Google Scholar] [PubMed]
Nhat-Duc, H.; Nguyen, Q.L.; Tran, V.D. Automatic recognition of asphalt pavement cracks using metaheuristic optimized edge detection algorithms and convolution neural network. Autom. Constr. 2018, 94, 203–213. [Google Scholar] [CrossRef]
Nhat-Duc, H.; Nguyen, Q.L.; Tran, V.D. Wavelet-morphology based detection of incipient linear cracks in asphalt pavements from RGB camera imagery and classification using circular Radon transform. Adv. Eng. Inform. 2016, 30, 481–499. [Google Scholar]
Gkantou, M.; Muradov, M.; Kamaris, G.S.; Hashim, K.; Atherton, W.; Kot, P. Novel Electromagnetic Sensors Embedded in Reinforced Concrete Beams for Crack Detection. Ital. Natl. Conf. Sens. 2019, 19, 5175. [Google Scholar] [CrossRef] [PubMed]
Rodríguez-Martin, M.; Lagüela, S.; González-Aguilera, D.; Arias, P. Cooling analysis of welded materials for crack detection using infrared thermography. Infrared Phys. Technol. 2014, 67, 547–554. [Google Scholar] [CrossRef]
Hosseini, Z.; Momayez, M.; Hassani, F.; Lévesque, D. Detection of inclined cracks inside concrete structures by ultrasonic SAFT. AIP Conf. Proc. 2008, 975, 1298–1304. [Google Scholar]
Koshti, A. X-ray ray tracing simulation and flaw parameters for crack detection. In Health Monitoring of Structural and Biological Systems XII; SPIE: Bellingham, DC, USA, 2018. [Google Scholar]
Huang, J.; Liu, W.; Sun, X. A Pavement Crack Detection Method Combining 2D with 3D Information Based on Dempster-Shafer Theory. comput. aided Civ. Infrastruct. Eng. 2014, 29, 299–313. [Google Scholar] [CrossRef]
Kheradmandi, N.; Mehranfar, V. A critical review and comparative study on image segmentation-based techniques for pavement crack detection. construction and Build. Mater. 2022, 321, 126162. [Google Scholar] [CrossRef]
Han, W.; Wang, Q. Pavement Crack Detection based on yolov3. In Proceedings of the 2019 2nd International Conference on Safety Produce Informatization (IICSPI), Chongqing, China, 28–30 November 2019. [Google Scholar]
Hacıefendioğlu, K.; Başağa, H.B. Concrete Road Crack Detection Using Deep Learning-Based Faster R-CNN Method. Iranian Journal of Science and Technology. Trans. Civ. Eng. 2022, 46, 1621–1633. [Google Scholar]
Yu, Z.; Shen, Y.; Shen, C. A real-time detection approach for bridge cracks based on YOLOv4-FPM. Autom. Constr. 2021, 122, 103514. [Google Scholar] [CrossRef]
Zhang, Y.; Zuo, Z.; Xu, X.; Wu, J.; Zhu, J.; Zhang, H.; Wang, J.; Tian, Y. Road damage detection using UAV images based on multi-level attention mechanism. Autom. Constr. 2022, 144, 104613. [Google Scholar] [CrossRef]
Tan, Y.; Cai, R.; Li, J.; Chen, P.; Wang, M. Automatic detection of sewer defects based on improved you only look once algorithm. Autom. Constr. 2021, 131, 103912. [Google Scholar] [CrossRef]
Alshawabkeh, S.; Wu, L.; Dong, D.; Cheng, Y.; Li, L.; Alanaqreh, M. Automated Pavement Crack Detection Using Deep Feature Selection and Whale Optimization Algorithm. Comput. Mater. Contin. 2023, 77, 63–77. [Google Scholar] [CrossRef]
Chen, S.; Feng, Z.; Xiao, G.; Chen, X.; Gao, C.; Zhao, M.; Yu, H. Pavement Crack Detection Based on the Improved Swin-Unet Model. Buildings 2024, 14, 1442. [Google Scholar] [CrossRef]
Liu, Z.; Cao, Y.; Wang, Y.; Wang, W. Computer vision-based concrete crack detection using U-net fully convolutional networks. Autom. Constr. 2019, 104, 129–139. [Google Scholar] [CrossRef]
Shi, Y.; Cui, L.; Qi, Z.; Meng, F.; Chen, Z. Automatic Road Crack Detection Using Random Structured Forests. IEEE Trans. Intell. Transp. Syst. 2016, 17, 3434–3445. [Google Scholar] [CrossRef]
Guo, F.; Qian, Y.; Liu, J.; Yu, H. Pavement crack detection based on transformer network. Autom. Constr. 2023, 145, 104646. [Google Scholar] [CrossRef]
Zhou, S.; Song, W. Crack segmentation through deep convolutional neural networks and heterogeneous image fusion. Autom. Constr. 2021, 125, 103605. [Google Scholar] [CrossRef]
Zhang, J.; Jing, J.; Lu, P.; Song, S. Improved MobileNetV2-SSDLite for automatic fabric defect detection system based on cloud-edge computing. Measurement 2022, 201, 111665. [Google Scholar] [CrossRef]
Tang, W.; Yang, Q.; Hu, X.; Yan, W. Deep learning-based linear defects detection system for large-scale photovoltaic plants based on an edge-cloud computing. Sol. Energy 2022, 231, 527–535. [Google Scholar] [CrossRef]
Park, D.; Kim, S.; An, Y.; Jung, J.Y. LiReD: A light-weight real-time fault detection system for edge computing using LSTM recurrent neural networks. Sensors 2018, 18, 2110. [Google Scholar] [CrossRef]
Zhang, Y.M.; Lee, C.C.; Hsieh, J.W.; Fan, K.C. CSL-YOLO: A new lightweight object detection system for edge computing. arXiv 2021, arXiv:2107.04829. [Google Scholar]
Wang, Y.; Liu, M.; Zheng, P.; Yang, H.; Zou, J. A smart surface inspection system using faster R-CNN in cloud-edge computing environment. Adv. Eng. Inform. 2020, 43, 101037. [Google Scholar] [CrossRef]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar]
Srinivas, A.; Lin, T.Y.; Parmar, N.; Shlens, J.; Abbeel, P.; Vaswani, A. Bottleneck transformers for visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
Trinh, H.C.; Le, D.H.; Kwon, Y.K. PANET: A GPU-Based Tool for Fast Parallel Analysis of Robustness Dynamics and Feed-Forward/Feedback Loop Structures in Large-Scale Biological Networks. PLoS ONE 2014, 9, e103010. [Google Scholar] [CrossRef] [PubMed]
Yang, L.; Zhang, R.Y.; Li, L.; Xie, X. SimAM: A Simple, Parameter-Free Attention Module for Convolutional Neural Networks. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021. [Google Scholar]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020. [Google Scholar]
Ramachandran, P.; Zoph, B.; Le, Q.V. Searching for activation functions. arXiv 2017, arXiv:1710.05941. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Bodla, N.; Singh, B.; Chellappa, R.; Davis, L.S. Soft-NMS--improving object detection with one line of code. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
Remon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv 2022, arXiv:2207.02696v1. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-time Object Detection. arXiv 2024, arXiv:2304.08069v3. [Google Scholar]

Figure 1. The architecture of TF-MobileNet.

Figure 2. The network structure of Bneck.

Figure 3. FPN and PAN fusion process in PANet.

Figure 4. Example of IoU calculation.

Figure 5. SENet.

Figure 6. The structure of SimAM.

Figure 7. Data enhancement. (a) Indicates the original image; (b) indicates the image after applying a flip of 180 degrees.

Figure 8. Confusion matrix.

Figure 9. Crack automatic detection.

Figure 10. Edge device picture detection.

Figure 11. Other data sources were used to detect images.

Figure 12. Crack detection in sunny weather.

Figure 13. Crack detection in post-rain environment.

Table 1. Configuration of server.

Models	Data
Operating System	Red Hat 4.8.5-28
Processor	Intel Xeon Silver 4314
GPU	NVIDIA RTX A100

Table 2. Configuration of personal computing.

Models	Data
Operating System	Windows 10
Processor	Inter^®Core™i9-10900K
GPU	RTX3080, 64 G

Table 3. Results of ablation experiments.

Number	Models	p	R	mAP	Para.	GFlop
A	M3 + PANet	0.752	0.763	0.809	1630732	2.8
B	A + TF (backbone)	0.774	0.8	0.842	2310220	3.3
C	B + SE (backbone)	0.816	0.814	0.853	2834508	3.7
D	C + TF (PANet)	0.828	0.818	0.862	3099084	4.0
E	E + SimAM (Neck)	0.874	0.831	0.908	3122326	4.0

where M3 denotes MobileNetV3. TF (backbone) denotes the incorporation of the Transformer into the feature extraction network. TF (PANet) denotes the fusion of the Transformer to improve the feature fusion network.

Table 4. Comparison experimental results.

Number	Models	p	R	mAP	Para.	GFlop	Weight
A	YOLOv3	0.69	0.738	0.792	8666692	12.9	17035
B	YOLOv4	0.802	0.81	0.821	6056606	16.5	47435
C	YOLOv5s	0.838	0.876	0.893	7012822	15.8	14115
D	YOLOv7	0.833	0.85	0.914	6014988	13.2	12001
E	SSD	0.784	0.723	0.736	11136374	13	92782
F	RT-DETR	0.903	0.842	0.921	19873044	56.9	39518
G	Ours	0.874	0.831	0.908	3122326	4.0	6515

Table 5. Edge device test comparison.

Models	Picture (ms)	Video (FPS)
YOLOv3	31.8	52
YOLOv4	27.9	46
YOLOv5s	35.3	42
YOLOv7	23.9	64
Ours	11.3	89

Table 6. Performance comparison of TF MobileNet on different device ends.

Device	Picture (ms)
CPU	13.3
GPU-a10	13.1
GPU-a30	19.3
RTX3080	16.1
Jetson Nano	11.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yu, J.; Qian, S.; Chen, C. Lightweight Crack Automatic Detection Algorithm Based on TF-MobileNet. Appl. Sci. 2024, 14, 9004. https://doi.org/10.3390/app14199004

AMA Style

Yu J, Qian S, Chen C. Lightweight Crack Automatic Detection Algorithm Based on TF-MobileNet. Applied Sciences. 2024; 14(19):9004. https://doi.org/10.3390/app14199004

Chicago/Turabian Style

Yu, Jiantao, Songrong Qian, and Cheng Chen. 2024. "Lightweight Crack Automatic Detection Algorithm Based on TF-MobileNet" Applied Sciences 14, no. 19: 9004. https://doi.org/10.3390/app14199004

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Lightweight Crack Automatic Detection Algorithm Based on TF-MobileNet

Abstract

1. Introduction

2. Related Works

2.1. Crack Identification Using Deep Learning Techniques

2.2. Real-Time Detection Methods Applying Edge Computing

3. Crack Detection Method Based on TF-MobileNet

3.1. Network Structure

3.2. Feature Extraction Network of TF-MobileNet

3.3. Feature Fusion Network of TF-MobileNet

3.4. Attention Mechanism Module

3.5. Loss Function

4. Experiment and Analysis

4.1. Experimental Platform

4.2. Experimental Data

4.3. Training Parameter Setting

4.4. Model Evaluation

4.5. Ablation Experiments

4.6. Comparison Experiments

5. Edge Deployment

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI