1. Introduction
Cracks serve as early indicators of structural damage in buildings, bridges, and roads, making their detection vital for structural health monitoring. Analyzing the morphological characteristics, positional information, and extent of internal damage in cracks allows for accurate safety assessments of buildings and infrastructure [
1,
2]. Timely detection and repair of cracks not only reduces maintenance costs but also prevents further structural deterioration, ensuring safety and durability [
3,
4].
Traditional crack detection methods such as visual inspections and manual evaluations are often costly and inefficient, relying heavily on the expertise of inspectors, which can lead to subjective and inconsistent results [
5]. Therefore, the development of objective and efficient automated crack detection methods have become a significant trend in this field. Various sensor-based methods for automatic or semi-automatic crack detection have been proposed, including crack meters, RGB-D sensors, and laser scanners [
6,
7,
8]. Although these sensors are accurate, they are expensive and challenging to deploy on large scales.
Advancements in computer vision technology have popularized image-based crack detection methods due to their long-distance, non-contact, and cost-effective nature. Traditional visual detection methods such as morphological image processing [
9,
10], filtering [
11,
12] and percolation models [
13] are simple to implement and computationally light, but suffer from limited generalization performance. Environmental noise such as debris around cracks further complicates detection in practical engineering environments.
Recently, deep learning-based semantic segmentation algorithms have significantly improved the accuracy and stability of image recognition in noisy environments. These algorithms excel at locating and labeling crack pixels, providing comprehensive information on crack distribution, width, length, and shape [
14]. Nevertheless, general models for understanding scenes often fail to capture the unique features of cracks, which are typically thin, long, and irregularly shaped [
15,
16]. Cracks typically span entire images while constituting only a small fraction of the pixel area, necessitating models with the ability to capture long-range dependencies between pixels. While self-attention mechanisms do well in aggregating long-distance contextual information [
17,
18,
19], they come with high computational costs that limit their detection speed. Additionally, cracks exhibit uneven distribution along with significant size variations, necessitating multiscale feature extraction [
20,
21,
22]. Although methods such as DeepLabV3+ [
23] and SegNext [
24] are able to capture multiscale information, they are computationally intensive and costly for large images.
The use of unmanned aerial vehicles (UAVs) for crack monitoring has become prevalent due to their flexibility, cost effectiveness, and ability to efficiently cover both large and difficult to access areas [
25]. However, the computational resources available on edge devices are typically limited and often lack high-power GPUs, making it crucial to deploy lightweight models that can perform processing and analysis while incurring low latency [
26]. Researchers have proposed lightweight networks that reduce computational costs by minimizing deep downsampling, reducing channel numbers, and optimizing convolutional design; however, reducing the subsampling stages can lead to models with insufficient receptive fields for covering large objects, as seen with ENet [
27]. Bilateral backbone models partially address this issue; for instance, BiSeNet [
28] adds a context path with fewer channels, while HrSegNet [
29] maintains high-resolution features while adjusting parameters to reduce channels. Unfortunately, these two-branch structures increase computational overhead, and reducing the channels can hinder the ability of the model to learn relational features.
Furthermore, several challenges affect the design of lightweight models for surface crack segmentation: (1) existing methods increase computational complexity by incorporating large kernel convolutions, multiple parallel branches, and feature pyramid structures to handle various object sizes and shapes; (2) diverse crack image scenes and complex backgrounds limit feature extraction by lightweight models, making it difficult to learn effective information from limited datasets; and (3) the subtle differences between cracked and normal areas introduces complications during segmentation. While adding multiple skip connections and auxiliary training branches can improve accuracy, this leads to increased memory overhead.
To address the aforementioned challenges, we propose CrackScopeNet, a lightweight segmentation network optimized for structural surface cracks. It features an optimized multiscale branching architecture and a carefully designed stripwise context attention (SWA) module.
Figure 1 presents a comparison on the CrackSeg9k dataset between traditional and lightweight crack-specific semantic segmentation networks and our segmentation approach in terms of their mean intersection over union (mIoU), model floating point operations (FLOPs), and number of parameters. The figure clearly illustrates that our method outperforms all shown models while having substantially fewer FLOPs and parameters. This is due to the initial design consideration capturing the local context information around small cracks as well as the remote context information, allowing the model to identify complete cracks while mitigating background interference.
Initially, in the local feature extraction stage we divide the channel data and perform three convolution operations with different convolution kernel sizes to obtain the local context information of cracks. Subsequently, we utilize a combination of strip pooling and one-dimensional convolution to capture remote context information without compressing channel features. Finally, we construct a lightweight multiscale feature fusion module to aggregate shallow detail and deep semantic information. In these modules, we employ depth-separable convolution, dropout, and residual connection structures to prevent overfitting, gradient disappearance, and gradient explosion problems, resulting in a lightweight neural network that is adaptable to crack detection.
In summary, our main contributions are as follows.
(1) We propose a novel crack image segmentation model called CrackScopeNet designed to meet the requirements of structural health monitoring. The model effectively extracts information at multiple levels during the downsampling stage and fuses key features during the upsampling stage.
(2) We introduce a lightweight multiscale branch module and stripwise context attention module designed to align with the morphological characteristics of fractures. These components effectively capture rich contextual information while minimizing computational costs. Compared to the previous HrSegNetB48 lightweight crack segmentation model, our approach achieves savings of approximately 5.2 times in memory and improves inference speed by 1.7 times.
(3) CrackScopeNet demonstrates state-of-the-art performance on the CrackSeg9k dataset, and exhibits excellent transferability to small datasets in specific scenarios; additionally, the model has a low inference delay on resource-constrained drone platforms, making it ideal for outdoor crack detection through computer vision. This ensures that drone platforms can perform rapid crack detection and analysis, enhancing the efficiency and effectiveness of structural health monitoring.
5. Experiment
In this section, we first conduct a comprehensive quantitative comparison between CrackScopeNet and the most advanced segmentation models in various metrics, visualize the results, and comprehensively analyze the detection performance. Subsequently, we explore the transfer learning capability of our model on crack datasets specific to other scenarios. Finally, we perform ablation studies to meticulously examine the significance and impact of each component within CrackScopeNet.
5.1. Comparative Experiments
As our primary objective is to achieve an exceptional balance between the accuracy of crack region extraction and inference speed, we compare CrackScopeNet with three types of models: classical general semantic segmentation models, advanced lightweight semantic segmentation models, and the latest models designed explicitly for crack segmentation, totaling thirteen models. Specifically, U-Net [
30], PSPNet [
31], SegNet [
53], DeeplabV3+ [
23], SegFormer [
19], and SegNext [
24] were selected as six classical segmentation models with high accuracy. BiSeNet [
28], BiSeNetV2 [
54], STDC [
55], TopFormer [
34], and SeaFormer [
35] were chosen due to their advantages in inference speed as lightweight semantic segmentation models; notably, SegFormer, TopFormer, and SeaFormer are all transformer-based methods that have demonstrated outstanding performance on large datasets such as Cityscapes [
56]. Additionally, we included two specialized crack segmentation models, U2Crack [
52] and HrSegNet [
29], which have been optimized for the crack detection scenario based on general semantic segmentation models.
It is important to note that in order to ensure that all models could be easily converted to ONNX format and deployed on edge devices with limited computational resources and memory, we selected the lightweight MobileNetV2 [
57] and ResNet-18 [
58] backbones for the DeepLabV3+ and BiSeNet models, respectively, while for SegFormer and SegNext we chose the lightweight versions SegFormer-B0 [
19] and SegNext_MSCAN_Tiny [
24], which are suited for real-time semantic segmentation as proposed by the authors. For TopFormer and SeaFormer, we discovered during training that the tiny versions had difficulty converging; thus, we only utilized their base versions.
Quantitative Results. Table 4 presents the performance of each baseline network and the proposed CrackScopeNet on the CrackSeg9k dataset, with the best values highlighted in bold. Analyzing the accuracy of different types of segmentation networks in the table, the larger models generally achieve higher mIoU scores than the lightweight models; specifically, compared to the classical high-accuracy models, our model achieves the best performance in terms of mIoU, recall, and F1 score, with scores of 82.15%, 89.24%, and 89.29%, respectively. Although our model’s precision (89.34%) is 1.26% lower than U-Net, U-Net has poor recall performance (−2.24%) and our model’s parameters and FLOPs are lower by 12 and 48 times, respectively.
In terms of network weight, the network proposed in this paper achieves the best balance between accuracy on the CrackSeg9k dataset and weight, as intuitively illustrated in
Figure 1. Our model achieves the highest mIoU with only 1.05 M parameters and 1.58 G FLOPs, making it incredibly lightweight. Its FLOPs are slightly higher than those of TopFormer and SeaFormer, but lower than all other small models; notably, due to the small size of the crack dataset, the learning capability of lightweight segmentation networks is evidently limited, as mainstream lightweight segmentation models do not consider the unique characteristics of cracks, resulting in poor performance. The proposed CrackScopeNet architecture successfully achieves the design goal of a lightweight network structure while maintaining superior segmentation performance, making it easily deployable on resource-constrained edge devices.
Moreover, compared to the state-of-the-art crack image segmentation algorithms, the proposed method achieves an mIoU of 82.15% with fewer parameters and FLOPs, surpassing the highest-accuracy versions of the U2Crack and HrSegNet models. Notably, the HrSegNet model employs an online hard example mining (OHEM) technique during training to improve its accuracy. In contrast, we only use the cross-entropy loss function for model parameter updating without deliberately employing any training tricks to enhance performance, showcasing the significant benefits of considering crack morphology during model design.
Qualitative Results. Figure 4,
Figure 5 and
Figure 6 display the qualitative results of all compared models. Our method achieves superior visual performance compared to the other models. From the first, second, and third rows of
Figure 4 it can be observed that CrackScopeNet and the more significant parameter segmentation algorithms achieve satisfactory results for high-resolution images with apparent crack features. In the fourth row, where the original image contains asphalt with color and texture similar to cracks, CrackScopeNet and SegFormer successfully overcome the background noise interference. This is attributed to their long-range contextual dependencies, which allow them to effectively capture the relationships between cracks. In the fifth row, the results show that CrackScopeNet exhibits robust performance even under uneven illumination conditions. This can be attributed to the design of the network structure, which considers both the local and global features of cracks while effectively suppressing noise.
Figure 5 clearly shows that the lightweight networks struggle to eliminate background noise interference, leading to fragmented segmentation results for fine cracks. This outcome is due to the limited parameters learned by lightweight models. Finally,
Figure 6 presents the visualization results of the most advanced crack segmentation models. U2Crack [
52], based on the ViT [
17] architecture, achieves a broader receptive field that somewhat alleviates background noise, though at the cost of significant computational overhead. HrSegNet [
29] maintains a high-resolution branch to capture rich and detailed features. As seen in the last two columns of
Figure 6, the increased number of channels in the HrSegNet network allow more detailed information to be extracted; however, this leads to background information being misclassified as cracks. This explains the high precision and low recall results of HrSegNet. In summary, CrackScopeNet outperforms the other segmentation models, demonstrating excellent crack detection performance under various noise conditions with lower parameters and FLOPs.
Inference on Navio2-based drones. In practical applications, there remains a substantial gap between real-time semantic segmentation algorithms designed and validated for mobile and edge devices, with the latter facing challenges such as limited memory resources and low computational efficiency. To better simulate edge devices used for outdoor structural health monitoring, we explored the inference speed of the models without GPU acceleration. Notably, to ensure that all models could be deployed on the drone platform without sacrificing accuracy through pruning or compression, we avoided using storage-intensive and computationally demanding models such as UNet, SegNet, and PSPNet. We converted the models to ONNX format and tested their inference speeds on Navio2-based drones equipped with a representative Raspberry Pi 4B, focusing on comparing our proposed model with models with tiny FLOPs and parameter counts: BiSeNetV2, DeepLabV3+, STDC, HrSegNetB48, SegFormer, TopFormer, SeaFormer. The test settings were as follows: input image size of 3 × 400 × 400, batch size of 1, and 2000 testing epochs. To ensure a fair comparison, we did not optimize or prune any models during deployment, meaning that the actual inference delay in practical applications could be further reduced from these test results.
As shown in
Figure 7, the test results indicate that when running on a highly resource-constrained drone platform, the proposed CrackScopeNet architecture achieves faster inference speed compared to other real-time or lightweight semantic segmentation networks based on convolutional neural networks, such as BiSeNet, BiSeNetV2, and STDC. Additionally, TopFormer and SeaFormer, which are designed with deployment on resource-limited edge devices in mind, both achieve extremely low inference latency; however, these models perform poorly on the crack datasets due to inadequate data volume. Our proposed model achieves remarkable crack segmentation accuracy while maintaining rapid inference speed, establishing its advantages over competing models.
These results confirm the efficacy of deploying the CrackScopeNet model on outdoor mobile devices, where high-speed inference and lightweight architecture are crucial for real-time processing and analysis of infrastructure surface cracks. By outperforming other state-of-the-art models, CrackScopeNet proves to be a suitable solution for addressing the challenges associated with outdoor edge computing.
5.2. Scaling Study
To explore the adaptability of our model, we adjusted the number of channels and stacked different numbers of CrackScope modules to cater to a broader range of application scenarios. Because CrackSeg9k is composed of multiple crack datasets, we also investigated the model’s transferability to specific application scenarios.
We adjusted the base number of channels after the stem from 32 to 64. Correspondingly, the number of channels in the remaining three feature extraction stages was increased from (32, 64, 128) to (64, 128, 160) in order to capture more features. Meanwhile, the number of CrackScope modules stacked in each stage was adjusted from (3, 3, 4) to (3, 3, 3); we refer to the adjusted model as CrackScopeNet_Large. First, we trained CrackScopeNet_Large on CrackSeg9k using the same parameter settings as the base version and evaluated the model on the test set. Furthermore, we used the training parameters and weights obtained from CrackSeg9k for these two models as the basis for transferring the models to downstream tasks in two specific scenarios. Images from the Ozgenel dataset, which consist of high-resolution concrete crack images similar to some scenarios in CrackSeg9k, were cropped to 448 × 448. The Aerial Track Dataset consists of low-altitude drone-captured images of post-earthquake highway cracks, a type of scene not present in CrackSeg9k; these were cropped to 512 × 512.
Table 5 presents the mIoU scores, parameter counts, and FLOPs of the base CrackScopeNet model and the high-accuracy version CrackScopeNet_Large on the CrackSeg9k dataset and the two specific scenario datasets. In this table, mIoU(F) represents the mIoU score obtained after pretraining the model on CrackSeg9k and fine-tuning it on the respective dataset. It is evident that the large version of the model achieves higher segmentation accuracy across all datasets, though with approximately double the parameters and three times the FLOPs. Therefore, if computational resources and memory are sufficient and higher accuracy in crack segmentation is required, the large version or further stacking of CrackScope modules can be employed.
For specific scenario training, whether from scratch or fine-tuning, all our models were trained for only 20 epochs. It can be seen that the models converge quickly even when training from scratch. We attribute this phenomenon to the initial design of CrackScopeNet, which considers the morphology of cracks and is able to successfully capture the necessary contextual information. For training using transfer learning, both versions of the model achieve remarkable results on the Ozgenel dataset, with mIoU scores of 90.1% and 92.31%, respectively. Even for the Aerial Track dataset, which includes low-altitude remote sensing images of highway cracks not seen in CrackSeg9k, both of our models still perform exceptionally well, achieving respective mIoU scores of 83.26% and 84.11%. These results demonstrate the proposed model’s rapid adaptability to small datasets, aligning well with real-world tasks.
5.3. Diagnostic Experiments
To gain more insights into CrackScopeNet, a set of ablative studies were conducted on our proposed model. All the methods mentioned in this section were trained with the same parameters for efficiency in 200 epochs.
Stripwise Context Attention. First, we examined the role of the critical SWA module in CrackScopeNet by replacing it with two advanced attention mechanisms, CBAM [
37] and CA [
38]. The results shown in
Table 6 demonstrate that without any attention mechanism, merely stacking convolutional neural networks for feature extraction yields poor performance due to the limited receptive field. Next, the SWA attention mechanism based on strip pooling and one-dimensional convolution was adopted, allowing the network structure to capture long-range contextual information. Under this configuration, the model exhibited the best performance.
Figure 8 shows the class activation maps (CAM) [
59] before the segmentation head of CrackScopeNet. It can be observed that the model without SWA is easily disturbed by shadows, whereas with the SWA module the model can focus on the global crack areas. Next, we sequentially replaced the SWA module with the channel–spatial feature-based CBAM attention mechanism and the coordinate attention (CA) mechanism, which also uses strip pooling. While the model parameters did not change significantly, the performance declined by 0.2% and 0.17%, respectively.
Furthermore, we explored the benefits of different attention mechanisms for other models by optimizing the advanced HrSegNetB48 lightweight crack segmentation network [
29]. HrSegNetB48 consists of high-resolution and auxiliary branches, merging shallow detail information with deep semantic information at each stage. Therefore, we added the SWA, CBAM, and CA attention mechanisms after feature fusion to capture richer features.
Table 7 shows the performance of HrSegNetB48 with the different attention mechanisms, clearly indicating that introducing the SWA attention mechanism to capture long-range contextual information provides the most significant benefit.
Multiscale Branch. Next, we examined the effect of the multiscale branch in the CrackScope module. To ensure fairness, we replaced the multiscale branch with a convolution of a larger kernel size (5 × 5 instead of 3 × 3). The results with and without the multiscale branch are shown in
Table 6. It is evident that using a 5 × 5 kernel size convolution instead of the multiscale branch, decreases the mIoU score (−0.16%) despite having more floating-point computations. This demonstrates that blindly adopting large kernel convolutions increases computational overhead without significant performance improvements. The benefits conferred by the multiscale branch were further analyzed through the CAM. As shown in the third column of
Figure 8, when the multiscale branch is not used, it is obvious that the network misses the feature information of small cracks, while the model with this branch can perfectly capture the features of cracks with various shapes and sizes.
Decoder. CrackScopeNet uses a simple decoder to fuse feature information of different scales, then complete the compression of channel features and the fusion of features at different stages. At present, the most popular decoders use an atrous spatial pyramid pooling (ASPP) [
23] module to introduce multi-scale information. In order to explore whether the introduction of an ASPP module could benefit to our model and investigate the effectiveness of our proposed lightweight decoder, we replaced decoder with the ASPP method adopted by DeepLabV3+ [
23]. The results are shown in the last two rows of
Table 6. It can be seen that the computational overhead is large because of the need to perform parallel dilated convolution operations on deep semantic information; however, the performance of the model does not improve. This shows that using multiple sets of dilated convolutions to capture multiscale feature incurs additional computational overhead while not contributing to the performance improvement of our model
5.4. Experiment Conclusions
Based on the comparative experiments conducted in previous sections, CrackScopeNet demonstrates significant advantages over both classical and lightweight semantic segmentation models in terms of performance, parameter count, and FLOPs. On the composite CrackSeg9k dataset, CrackScopeNet achieves high segmentation accuracy and shows excellent transferability to specific scenarios. Notably, it maintains a low parameter count and minimal FLOPs, which translates to low-latency inference speeds on resource-constrained drone platforms without the need for GPU acceleration. This efficiency is achieved by considering crack morphology characteristics, allowing CrackScopeNet to remain lightweight and computationally efficient. This makes it particularly suitable for deployment on mobile devices in outdoor environments. In summary, CrackScopeNet achieves a better balance between segmentation accuracy and inference speed compared to the other networks examined in this study, making it a promising solution for timely crack detection and analysis using drones in infrastructure surfaces.
However, there remain some drawbacks in this study. The inference speed and performance were not tested on other hardware platforms, such as the Snapdragon 865, which may offer different computational capabilities. Additionally, our study did not explore the potential acceleration that NPUs (Neural Processing Units) or GPUs could provide. Further investigation into how these processing units can be fully utilized could offer significant improvements in the efficiency and performance of the model.
6. Discussion
In this paper, we present CrackScopeNet, a lightweight infrastructure surface crack segmentation network specifically designed to address the challenges posed by varying crack sizes, irregular contours, and subtle differences between cracks and normal regions in real-world applications. The proposed network structure captures the local context information and long-distance dependencies of cracks through a lightweight multiscale branch and an SWA attention mechanism, respectively, and effectively extracts the low-level details and high-level semantic information required for accurate crack segmentation.
In this work, we find that using channel-wise partitioning to apply different kernel sizes effectively captures multiscale features without introducing significant computational overhead. Additionally, by incorporating an attention mechanism that accounts for long-range dependencies, it is possible to compensate for the limitations of downsampling without resorting to additional detail branches, which would otherwise increase computational demands. Our experimental results demonstrate that CrackScopeNet delivers robust performance and high accuracy. It outperforms larger models like SegFormer in terms of efficiency, significantly reducing the number of parameters and computational cost. Furthermore, our method achieves faster inference speeds than other lightweight models such as BiSeNet and STDC even in the absence of GPU acceleration. This performance makes it highly suitable for deployment on resource-constrained drone platforms, enabling efficient and low-latency crack detection in structural health monitoring. By making the model and code publicly available, we aim to advance the application of UAV remote sensing technology in infrastructure maintenance, providing an efficient and practical tool for the timely detection and analysis of cracks.
Furthermore, utilizing UAVs to monitor crack development in geological disaster scenarios can greatly aid in warning efforts. CrackScopeNet, having proven effective in infrastructure crack detection, has the potential to be adapted for these contexts through domain adaptation. We have undertaken preliminary investigations by capturing images of hazardous rock formations with UAVs and using our model to extract crack regions, as illustrated in
Figure 9. These environments present more intricate crack patterns, including various types and complex curved damage. Our approach currently exhibits limitations in detecting fine cracks, particularly those that blend with the background. Our next work will focus on enhancing the model sensitivity and capacity in order to accurately identify smaller and more complex crack patterns in challenging conditions, especially in geological disaster monitoring.
Lastly, in this era of large models, our model has only been trained and evaluated on datasets containing a few thousand images; the need for a large amount of data collection and manual labeling represents a bottleneck. Recent advances in generative AI and self-supervised learning can bypass the limitations imposed by the need for data acquisition and manual annotation. Researchers can use the inherent structure or attributes of existing data to generate richer “synthetic images” and “synthetic labels”, which is a very interesting research avenue that could be applied to crack detection.