CrackScopeNet: A Lightweight Neural Network for Rapid Crack Detection on Resource-Constrained Drone Platforms

Zhang, Tao; Qin, Liwei; Zou, Quan; Zhang, Liwen; Wang, Rongyi; Zhang, Heng

doi:10.3390/drones8090417

Open AccessArticle

CrackScopeNet: A Lightweight Neural Network for Rapid Crack Detection on Resource-Constrained Drone Platforms

by

Tao Zhang

,

Liwei Qin

,

Quan Zou

,

Liwen Zhang

,

Rongyi Wang

and

Heng Zhang

^*

College of Computer and Information Science College of Software, Southwest University, Chongqing 400715, China

^*

Author to whom correspondence should be addressed.

Drones 2024, 8(9), 417; https://doi.org/10.3390/drones8090417

Submission received: 7 July 2024 / Revised: 15 August 2024 / Accepted: 22 August 2024 / Published: 23 August 2024

(This article belongs to the Special Issue Application of UAS in Construction)

Download

Browse Figures

Versions Notes

Abstract

:

Detecting cracks during structural health monitoring is crucial for ensuring infrastructure safety and longevity. Using drones to obtain crack images and automate processing can improve the efficiency of crack detection. To address the challenges posed by the limited computing resources of edge devices in practical applications, we propose CrackScopeNet, a lightweight segmentation network model that simultaneously considers local and global crack features while being suitable for deployment on drone platforms with limited computational power and memory. This novel network features a multi-scale branch to improve sensitivity to cracks of varying sizes without substantial computational overhead along with a stripe-wise context attention mechanism to enhance the capture of long-range contextual information while mitigating the interference from complex backgrounds. Experimental results on the CrackSeg9k dataset demonstrate that our method leads to a significant improvement in prediction performance, with the highest mean intersection over union (mIoU) scores reaching 82.12%, and maintains a lightweight architecture with only 1.05 M parameters and 1.58 G floating point operations (FLOPs). In addition, the proposed model excels in inference speed on edge devices without a GPU thanks to its low FLOPs. CrackScopeNet contributes to the development of efficient and effective crack segmentation networks suitable for practical structural health monitoring applications using drone platforms.

Keywords:

computer vision; crack detection; drone platforms; semantic segmentation; lightweight neural network

1. Introduction

Cracks serve as early indicators of structural damage in buildings, bridges, and roads, making their detection vital for structural health monitoring. Analyzing the morphological characteristics, positional information, and extent of internal damage in cracks allows for accurate safety assessments of buildings and infrastructure [1,2]. Timely detection and repair of cracks not only reduces maintenance costs but also prevents further structural deterioration, ensuring safety and durability [3,4].

Traditional crack detection methods such as visual inspections and manual evaluations are often costly and inefficient, relying heavily on the expertise of inspectors, which can lead to subjective and inconsistent results [5]. Therefore, the development of objective and efficient automated crack detection methods have become a significant trend in this field. Various sensor-based methods for automatic or semi-automatic crack detection have been proposed, including crack meters, RGB-D sensors, and laser scanners [6,7,8]. Although these sensors are accurate, they are expensive and challenging to deploy on large scales.

Advancements in computer vision technology have popularized image-based crack detection methods due to their long-distance, non-contact, and cost-effective nature. Traditional visual detection methods such as morphological image processing [9,10], filtering [11,12] and percolation models [13] are simple to implement and computationally light, but suffer from limited generalization performance. Environmental noise such as debris around cracks further complicates detection in practical engineering environments.

Recently, deep learning-based semantic segmentation algorithms have significantly improved the accuracy and stability of image recognition in noisy environments. These algorithms excel at locating and labeling crack pixels, providing comprehensive information on crack distribution, width, length, and shape [14]. Nevertheless, general models for understanding scenes often fail to capture the unique features of cracks, which are typically thin, long, and irregularly shaped [15,16]. Cracks typically span entire images while constituting only a small fraction of the pixel area, necessitating models with the ability to capture long-range dependencies between pixels. While self-attention mechanisms do well in aggregating long-distance contextual information [17,18,19], they come with high computational costs that limit their detection speed. Additionally, cracks exhibit uneven distribution along with significant size variations, necessitating multiscale feature extraction [20,21,22]. Although methods such as DeepLabV3+ [23] and SegNext [24] are able to capture multiscale information, they are computationally intensive and costly for large images.

The use of unmanned aerial vehicles (UAVs) for crack monitoring has become prevalent due to their flexibility, cost effectiveness, and ability to efficiently cover both large and difficult to access areas [25]. However, the computational resources available on edge devices are typically limited and often lack high-power GPUs, making it crucial to deploy lightweight models that can perform processing and analysis while incurring low latency [26]. Researchers have proposed lightweight networks that reduce computational costs by minimizing deep downsampling, reducing channel numbers, and optimizing convolutional design; however, reducing the subsampling stages can lead to models with insufficient receptive fields for covering large objects, as seen with ENet [27]. Bilateral backbone models partially address this issue; for instance, BiSeNet [28] adds a context path with fewer channels, while HrSegNet [29] maintains high-resolution features while adjusting parameters to reduce channels. Unfortunately, these two-branch structures increase computational overhead, and reducing the channels can hinder the ability of the model to learn relational features.

Furthermore, several challenges affect the design of lightweight models for surface crack segmentation: (1) existing methods increase computational complexity by incorporating large kernel convolutions, multiple parallel branches, and feature pyramid structures to handle various object sizes and shapes; (2) diverse crack image scenes and complex backgrounds limit feature extraction by lightweight models, making it difficult to learn effective information from limited datasets; and (3) the subtle differences between cracked and normal areas introduces complications during segmentation. While adding multiple skip connections and auxiliary training branches can improve accuracy, this leads to increased memory overhead.

To address the aforementioned challenges, we propose CrackScopeNet, a lightweight segmentation network optimized for structural surface cracks. It features an optimized multiscale branching architecture and a carefully designed stripwise context attention (SWA) module. Figure 1 presents a comparison on the CrackSeg9k dataset between traditional and lightweight crack-specific semantic segmentation networks and our segmentation approach in terms of their mean intersection over union (mIoU), model floating point operations (FLOPs), and number of parameters. The figure clearly illustrates that our method outperforms all shown models while having substantially fewer FLOPs and parameters. This is due to the initial design consideration capturing the local context information around small cracks as well as the remote context information, allowing the model to identify complete cracks while mitigating background interference.

Initially, in the local feature extraction stage we divide the channel data and perform three convolution operations with different convolution kernel sizes to obtain the local context information of cracks. Subsequently, we utilize a combination of strip pooling and one-dimensional convolution to capture remote context information without compressing channel features. Finally, we construct a lightweight multiscale feature fusion module to aggregate shallow detail and deep semantic information. In these modules, we employ depth-separable convolution, dropout, and residual connection structures to prevent overfitting, gradient disappearance, and gradient explosion problems, resulting in a lightweight neural network that is adaptable to crack detection.

In summary, our main contributions are as follows.

(1) We propose a novel crack image segmentation model called CrackScopeNet designed to meet the requirements of structural health monitoring. The model effectively extracts information at multiple levels during the downsampling stage and fuses key features during the upsampling stage.

(2) We introduce a lightweight multiscale branch module and stripwise context attention module designed to align with the morphological characteristics of fractures. These components effectively capture rich contextual information while minimizing computational costs. Compared to the previous HrSegNetB48 lightweight crack segmentation model, our approach achieves savings of approximately 5.2 times in memory and improves inference speed by 1.7 times.

(3) CrackScopeNet demonstrates state-of-the-art performance on the CrackSeg9k dataset, and exhibits excellent transferability to small datasets in specific scenarios; additionally, the model has a low inference delay on resource-constrained drone platforms, making it ideal for outdoor crack detection through computer vision. This ensures that drone platforms can perform rapid crack detection and analysis, enhancing the efficiency and effectiveness of structural health monitoring.

To facilitate further research and application, the code for this work is available on GitHub (at https://github.com/ttkingzz/CrackScopeNet, accessed on 5 July 2024).

2. Related Work

Deep learning-based semantic segmentation has significantly advanced crack detection. Cutting edge research in this field primarily explores three key areas: achieving higher recognition accuracy, increasing inference speed, and developing more efficient attention mechanisms. This section discusses related work in crack segmentation across these three aspects.

2.1. High-Accuracy Models

Recent advancements in semantic segmentation have been driven by efficient feature fusion and powerful feature extraction techniques. U-Net [30] has excelled in the medical domain by using skip connections to integrate feature maps at different scales, while PSPNet [31] employs atrous convolutions to capture multiscale context.

To enhance performance further, researchers have introduced transformer-based models. ViT [17] adapts the transformer architecture to image segmentation by processing image patches as sequences, leveraging self-attention to capture the global dependencies crucial for precise segmentation. SegFormer [19] uses hierarchical transformers for multiscale feature extraction as well as integrating features with the multilayer perceptron (MLP) decoder, eliminating positional encoding, and enhancing robustness.

Researchers have also explored the benefits of using larger convolutional kernels; ConvNeXt [32] integrates modern CNN design with 7 × 7 convolutions to enhance feature representation and efficiently capture long-range dependencies, while RepLKNet [33] uses 31 × 31 convolutions to capture extensive spatial information and re-parameterizes kernels for efficient inference, thereby leveraging large receptive fields with minimal computational cost.

However, these models often require significant parameter counts and computational resources to capture multiscale features. As a result, most tend to have large sizes and slow inference speeds, hindering their practical application.

2.2. Lightweight Models

Traditional semantic segmentation models often require substantial computational resources, with some models demanding tens of billions of operations per second, potentially exceeding the computational capabilities of edge platforms. Consequently, recent research has focused on designing lightweight neural networks that achieve real-time performance.

ENet [27] employs smaller-sized feature maps with fewer channels in the early layers of the network, and also reduces downsampling rates; however, this approach prioritizes speed over accuracy. While dilated convolutions can be used to achieve a larger receptive field, there is still a loss of spatial detail, especially in boundary areas. To compensate for the loss of detail, a dual-branch approach can be adopted to capture both low-level detail and high-level semantic information. BiSeNet [28] utilizes a spatial path to maintain the spatial dimensions of the original image, capturing rich spatial information, while a context path quickly performs downsampling to acquire global semantic information. HrSegNet [29] maintains a backbone branch with an unchanged scale, aiming to preserve spatial details, and introduces a semantic guidance branch to provide deep semantic information to the main branch. Although dual-branch network structures address both detail and semantic information, and can achieve real-time inference with GPU acceleration, the additional branches result in increased parameter count and computational demand. Therefore, deploying such models on drone platforms without high-performance GPUs is not feasible.

The TopFormer [34] model combines the CNN and transformer architectures to balance local feature extraction and global context understanding, achieving reduced computational complexity through a lightweight design. The SeaFormer [35] model employs simplified transformer modules and innovative attention mechanisms to prioritize important features, further enhancing model efficiency. Although these models significantly reduce the computational load and parameter count in the transformer, they require extensive datasets for training, which is not applicable to the field of crack detection.

2.3. Attention Mechanism

Attention mechanisms have gained widespread recognition in the field of computer vision due to their ability to effectively enhance the performance of deep neural networks. The squeeze-and-excitation (SE) attention mechanism was the first to successfully introduce channel attention into neural networks; this approach using 2D global pooling to compress spatial dimensions into channel dimensions [36], facilitating enhanced feature learning. Subsequently, attention mechanisms have made significant progress in two directions: (1) aggregating only the channel features, and (2) combining channel features with spatial features. Specifically, a convolutional block attention module (CBAM) utilizes both average pooling and max pooling to integrate features from the channel and spatial dimensions [37]. Coordinate attention (CA) is an attention mechanism designed to improve the focus on important regions while retaining precise positional information [38]. Unlike traditional attention mechanisms, CA splits attention into a coordinate stage and spatial stage.

Unfortunately, existing attention mechanisms either lack long-range dependencies or reduce the number of channels during channel feature extraction in order to minimize parameters, which may lead to the loss of channel information.

3. Methods

To achieve a high-performance lightweight crack segmentation model, we introduce CrackScopeNet, which is characterized by two main features: (1) partitioning the feature map channels for convolutions at different scales to extract local multiscale information without incurring excessive computational overhead, and (2) reducing the downsampling rate to 1/16 without using additional auxiliary branches, instead incorporating a stripwise attention mechanism tailored to crack morphology in order to capture long-range dependencies.

3.1. Model Overview

We meticulously design feature extraction and global context attention modules based on crack morphological information, which we collectively name CrackScopeNet. As shown at the top of Figure 2, our proposed model adopts a simple encoder–decoder structure similar to most previous works [19,23]. It comprises a feature encoder and a feature decoder. In the encoder, the crack image is input into the network, passing through three feature extraction stages to capture detailed and deep semantic information. Each stage comprises a downsampling convolution layer and a series of CrackScope (CS) modules, each containing a multiscale branch and an SWA module. The decoder gradually restores the crack features to the original resolution, merging different levels of feature information through the feature fusion module to produce an output with rich spatial and textural features. The detailed configuration of CrackScopeNet is listed in Table 1.

3.2. Feature Extraction

Inspired by previous advanced works [24,32], we adopt a backbone structure similar to ViT [17] for the block design at each feature extraction stage. As illustrated in Figure 2a, each CS block consists of two residual-connected submodules connected via residual connections, namely, the CS module and MLP module. To prevent overfitting during training, we incorporate a dropout layer in each submodule. The CS module comprises a multiscale branch module and a stripwise context attention (SWA) module. This section focuses on the multiscale features, while the SWA is introduced in the next section.

Unlike general object segmentation, crack segmentation targets the identification and localization of crack areas with various shapes. Due to different development times and external influences, coarse cracks are often accompanied by fine cracks. To address the challenge of varying crack proportions in different regions, we introduce CS modules to capture multiscale texture features, using receptive fields with kernel sizes of 3 × 3, 5 × 5, and 7 × 7, respectively.

When extracting multiscale information with a branching structure, the FLOPs increase exponentially with the number of branches and size of the convolution kernels. For regular convolution computation, given output spatial dimensions

W * H

, input channels

C_{i n}

, output channels

C_{o u t}

, and a convolution kernel of size

k_{m} * k_{n}

, the FLOPs calculation formula is expressed as follows:

F L O P s = W * H * C_{i n} * C_{o u t} * k_{m} * k_{n} .

(1)

As shown in Figure 2b, to reduce computational overhead, we adjust the input and output channels as well as the convolution kernel sizes for the multiscale branches. First, we divide the input features into three parts along the channels, allocating half the channels to the branch with the smallest kernel (3 × 3) and a quarter of the channels to each of the two branches with larger kernels. Among these, as in ConvNext [32], the largest convolution kernel that we use is 7 × 7. After the convolution computations are completed in the three branches, the features are then merged along the channel dimension and a 1 × 1 convolution is used to model the relationships between all channels.

However, compared to 3 × 3 convolutions, these 7 × 7 large-kernel convolutions incur more than five times the computational cost. In order to further reduce the computational cost, we employ strip convolutions in our branch design to achieve the same receptive field while being more computationally lightweight [24]. Because cracks are predominantly strip-shaped, strip convolutions are particularly effective in capturing these features. Therefore, (1 × 5, 5 × 1) and (1 × 7, 7 × 1) strip convolutions are used to replace 5 × 5 and 7 × 7 2D convolutions for capturing local contextual information. Then, we design a remote context information attention module to assist the CrackScope module in obtaining global contextual information, which is introduced below.

3.3. Stripwise Context Attention

The attention mechanism is an adaptive selection process that makes the network focus on important parts. As discussed above, the multiscale branch module in the CS module is used to extract multiscale local contextual information. Inspired by GhostNetv2 [39] and coordinate attention (CA) [38], we utilize strip pooling to capture long-distance contextual information in the spatial dimensions. In addition, we design a stripwise context attention (SWA) module and integrate it into the multiscale module. As illustrated in Figure 2c, given an input feature x with channel number c and spatial dimensions

H * W

, we perform max pooling operations in both the height and width dimensions to obtain global width and height features

Z^{h}

and

Z^{w}

.

\begin{matrix} Z^{h} = max_{0 \leq i < W} x (h, i) \\ Z^{w} = max_{0 \leq j < H} x (j, w) \end{matrix}

(2)

Further, to avoid the issue of compressed channel numbers that can arise in CBAM [37] and CA [38], we apply one-dimensional depthwise separable convolutions to model the relationships across different spatial dimensions and channels. The attention representations in the horizontal and vertical directions are denoted as follows:

\begin{matrix} y^{h} = δ (F_{2} (F_{1} (F_{0} (Z^{h})))) \\ y^{w} = δ (F_{2} (F_{1} (F_{0} (Z^{w})))) \end{matrix}

(3)

where

F_{0}

is a one-dimensional depthwise separable convolution with a kernel size of 5 × 5,

F_{1}

is a one-dimensional convolution used to capture the inter-channel information,

F_{2}

is batch normalization, and

δ

is the sigmoid activation function.

Finally, the output of the attention module can be obtained by applying equation show below.

Y = x * y^{h} * y^{w}

(4)

3.4. Feature Fusion

To reduce the loss of key feature information and ensure the accuracy of crack detection, we integrate the SWA module into the upsampling block for precise feature restoration and fusion. Notably, the decoder focuses primarily on fine-tuning the details of the feature map results, allowing for a relatively simple design. As such, we do not use atrous spatial pyramid pooling [31] to extract multiscale features of high-level semantic information. On the one hand, using many dilated convolutions adds unnecessary computational overhead and increases network complexity. On the other hand, our subsequent ablation experiments demonstrate that further extraction of multiscale information from feature maps does not enhance performance.

As shown in Figure 2d, the high-level semantic information is adjusted through pointwise convolution operations, after which bilinear interpolation is used to restore the feature map size for concatenation with features from the lower stages. Subsequently, in order to further fuse high-level semantic features with detailed texture features, we employ an SWA module with a shortcut connection to model feature relationships across the global space and channels while fully integrating the multiscale feature maps. Next, a small kernel convolution is used to refine the crack feature information. After multiscale fusion of the three-stage feature maps, they are fed into the segmentation head, which maps the feature map to the required segmentation output, completing the entire network computation process. Notably, to avoid the large computational overhead incurred by the decoder, we do not use transposed convolutions to learn more parameters; instead, we only select and fuse the features, resulting in a lightweight multiscale feature fusion module.

3.5. Auxiliary Branch

Inspired by the auxiliary branches in PSPNet [31], we further enhance the segmentation performance by adding three auxiliary branches in the CrackScopeNet encoder in order to improve the performance of the feature extractor during training. The auxiliary branches are used only during training and ignored during inference; thus, they do not affect the inference speed of the entire network structure. However, they provide additional gradient signals, allowing higher gradient propagation to each feature extraction stage, which helps to mitigate the gradient vanishing or exploding problem and improves the training stability of the encoder. For each auxiliary branch, we use the same segmentation head as the main branch and recover the original image size with different upsampling multiples. The total loss is the weighted sum of the binary cross-entropy losses from each segmentation head, as follows:

L_{t} = L_{p} + a L_{1} + b L_{2} + c L_{3}

(5)

where

L_{t}

and

L_{p}

represent the total loss and primary loss, respectively, while

L_{1}

,

L_{2}

, and

L_{3}

are the losses of the three auxiliary branches. In this work, the weights a, b, and c are all set to 0.5.

4. Experimental Datasets and Setup

4.1. Datasets

In crack segmentation research, the limited size and number of publicly available datasets poses a challenge for comprehensive algorithm evaluation. CrackSeg9k [40] addresses this by offering a substantial dataset designed specifically for crack segmentation tasks. Despite the CrackSeg9k images being captured with cameras and smartphones, their resolution and imaging angles are similar to UAV-captured images, allowing models to generalize well to UAV-based image data.

CrackSeg9k consists of 9255 images with a resolution of 400 × 400 pixels, each labeled for cracks and backgrounds. As shown in Table 2, this dataset is merged with ten smaller sub-datasets to enhance its diversity and robustness. The resulting dataset consists of various crack types, including linear, branched, webbed, and non-crack images, with examples shown in Figure 3a, ensuring that models trained on it can generalize across different crack patterns and conditions. Notably, the creators of CrackSeg9k have corrected for label noise, boosting the dataset’s reliability.

In addition to CrackSeg9k, we also used two specific-scene crack datasets: the close-range concrete crack dataset Ozgenel [49] and the low-altitude UAV-captured highway crack dataset Aerial Track [50], allowing us to further explore the generalization abilities of CrackScopeNet. Among these, the image scenes in the Ozgenel dataset are similar to the rock crack scenes in CrackSeg9k, while the Aerial Track dataset includes post-earthquake highway crack images captured by UAVs, featuring predominantly small cracks amid significant interference. As examples, two randomly selected images from these two datasets are displayed in Figure 3b,c.

The Ozgenel dataset originally consists of 458 high-definition images (4032 × 3024 pixels) with annotated cracks collected from various buildings at Middle East Technical University. For our experiments, we cropped these images into 448 × 448 pixel blocks while ensuring a crack area of at least 1% in each block. This process yielded 2600 images, which we divided into 70% for training, 10% for validation, and 20% for testing.

The Aerial Track dataset comprises 4118 highway crack images (448 × 448 pixels) captured by UAVs after an earthquake. The dataset is divided into three parts: training, validation, and testing, with 2620, 598, and 900 images, respectively. We transferred our models trained on CrackSeg9k to these two specific tasks.

4.2. Parameter Setting

4.2.1. Training and Fine-Tuning Configurations

Our experiments used the PaddleSeg [51] framework and were performed on a desktop with an NVidia Titan V GPU (12 GB) and Ubuntu 20.04 with CUDA 10.1. To prevent overfitting, we adopted data augmentation methods such as random horizontal flipping, scaling (0.5 to 2), cropping, resizing, normalization, and random distortion to vary the brightness, contrast, and saturation with 50% probability.

For the CrackSeg9k dataset, the specific training parameters are shown in Table 3. All training was conducted from scratch without pretraining on other datasets. To manage the limited GPU memory, models with high memory usage (e.g., UNet, PSPNet) had their batch size and initial learning rate halved while keeping the number of epochs unchanged, ensuring similar training effects.

In addition to CrackSeg9k, we transferred the models to the Ozgenel [40] and Aerial Track [50] datasets. During fine-tuning, we reduced the learning rate to 0.0001, limited the epochs to 20, and adjusted the batch size to 8. The input images were cropped to 448 × 448 for Ozgenel and 512 × 512 for Aerial Track.

4.2.2. Drone Platform

This study leveraged a drone-based platform to simulate edge model deployment, focusing on real-world applications where computational resources are limited. In the field of crack monitoring, DJI drones such as the M300RTK and Mavic 2 Pro are frequently employed due to their precision, flexibility, and advanced imaging capabilities [25,26]. However, DJI platforms often restrict the deployment of proprietary algorithms directly on the edge device, necessitating the transfer of images for postprocessing, which can delay crack detection and analysis. To overcome these limitations and allow for immediate crack analysis, we opted for a Navio2-equipped drone integrated with a Raspberry Pi 4B. This choice enables the deployment of crack segmentation models directly on the onboard computing device, facilitating real-time image processing. The Raspberry Pi 4B, powered by a Broadcom BCM2711 CPU running at 1.5 GHz, offers a balance between low power consumption and adequate computational capacity, making it a suitable platform for environments where high-performance GPUs are impractical. The Navio2 flight control board, with its built-in inertial measurement unit (IMU), barometer, and GPS module, provides essential flight control functionalities while allowing for seamless integration with the Raspberry Pi. This setup allowed us to deploy crack segmentation models directly on the computing device, enabling real-time analysis as the drone captures images.

One of the key objectives of this study was to evaluate the inference latency of various models in a resource-constrained environment. To ensure a fair comparison, each model was tested by loading images from a standardized test set directly onto the Raspberry Pi. This method allows for a consistent assessment of the performance of each model under identical conditions, ensuring that the differences in inference speeds are accurately measured and free from bias. In the practical application scenarios, the workflow remained consistent with our testing phase. The drone captured images through its onboard camera, then these images were processed in real time using the same crack extraction methods developed during our experiments, with output results either stored locally or transmitted to a ground station depending on the mission requirements. This approach demonstrates how low-cost resource-constrained platforms such as the Raspberry Pi can be effectively used for timely and efficient structural health monitoring when integrated with a Navio2-equipped drone, offering a scalable solution for rapid response in diverse and often remote environments.

4.3. Evaluation Metrics

Based on previous studies [29,52], we used four metrics to comprehensively evaluate model performance: precision (

P r

), recall (

R e

), F1 score (

F 1

), and mean intersection over union (

m I o U

). These indicators are defined as follows:

P r = \frac{T P}{T P + F P}

(6)

R e = \frac{T P}{T P + F N}

(7)

F 1 = \frac{2 * P r * R e}{P r + R e}

(8)

m I o U = m e a n (\frac{T P}{T P + F N + F P})

(9)

where true positive (

T P

) represents correctly classified crack pixels, false positive (

F P

) represents background pixels incorrectly classified as crack categories, and false negative (

F N

) represents cracks incorrectly identified as background.

In addition, we evaluated the computational cost and complexity of the model using the number of floating point operations (

F L O P s

) and

P a r a m e t e r s

. We also used the average inference latency of a single image deployed on the Navio2-based drone to evaluate the inference speed of the lightweight models. A lightweight model suitable for drone platforms requires a low parameter count, low FLOPS, and low inference latency.

5. Experiment

In this section, we first conduct a comprehensive quantitative comparison between CrackScopeNet and the most advanced segmentation models in various metrics, visualize the results, and comprehensively analyze the detection performance. Subsequently, we explore the transfer learning capability of our model on crack datasets specific to other scenarios. Finally, we perform ablation studies to meticulously examine the significance and impact of each component within CrackScopeNet.

5.1. Comparative Experiments

As our primary objective is to achieve an exceptional balance between the accuracy of crack region extraction and inference speed, we compare CrackScopeNet with three types of models: classical general semantic segmentation models, advanced lightweight semantic segmentation models, and the latest models designed explicitly for crack segmentation, totaling thirteen models. Specifically, U-Net [30], PSPNet [31], SegNet [53], DeeplabV3+ [23], SegFormer [19], and SegNext [24] were selected as six classical segmentation models with high accuracy. BiSeNet [28], BiSeNetV2 [54], STDC [55], TopFormer [34], and SeaFormer [35] were chosen due to their advantages in inference speed as lightweight semantic segmentation models; notably, SegFormer, TopFormer, and SeaFormer are all transformer-based methods that have demonstrated outstanding performance on large datasets such as Cityscapes [56]. Additionally, we included two specialized crack segmentation models, U2Crack [52] and HrSegNet [29], which have been optimized for the crack detection scenario based on general semantic segmentation models.

It is important to note that in order to ensure that all models could be easily converted to ONNX format and deployed on edge devices with limited computational resources and memory, we selected the lightweight MobileNetV2 [57] and ResNet-18 [58] backbones for the DeepLabV3+ and BiSeNet models, respectively, while for SegFormer and SegNext we chose the lightweight versions SegFormer-B0 [19] and SegNext_MSCAN_Tiny [24], which are suited for real-time semantic segmentation as proposed by the authors. For TopFormer and SeaFormer, we discovered during training that the tiny versions had difficulty converging; thus, we only utilized their base versions.

Quantitative Results. Table 4 presents the performance of each baseline network and the proposed CrackScopeNet on the CrackSeg9k dataset, with the best values highlighted in bold. Analyzing the accuracy of different types of segmentation networks in the table, the larger models generally achieve higher mIoU scores than the lightweight models; specifically, compared to the classical high-accuracy models, our model achieves the best performance in terms of mIoU, recall, and F1 score, with scores of 82.15%, 89.24%, and 89.29%, respectively. Although our model’s precision (89.34%) is 1.26% lower than U-Net, U-Net has poor recall performance (−2.24%) and our model’s parameters and FLOPs are lower by 12 and 48 times, respectively.

In terms of network weight, the network proposed in this paper achieves the best balance between accuracy on the CrackSeg9k dataset and weight, as intuitively illustrated in Figure 1. Our model achieves the highest mIoU with only 1.05 M parameters and 1.58 G FLOPs, making it incredibly lightweight. Its FLOPs are slightly higher than those of TopFormer and SeaFormer, but lower than all other small models; notably, due to the small size of the crack dataset, the learning capability of lightweight segmentation networks is evidently limited, as mainstream lightweight segmentation models do not consider the unique characteristics of cracks, resulting in poor performance. The proposed CrackScopeNet architecture successfully achieves the design goal of a lightweight network structure while maintaining superior segmentation performance, making it easily deployable on resource-constrained edge devices.

Moreover, compared to the state-of-the-art crack image segmentation algorithms, the proposed method achieves an mIoU of 82.15% with fewer parameters and FLOPs, surpassing the highest-accuracy versions of the U2Crack and HrSegNet models. Notably, the HrSegNet model employs an online hard example mining (OHEM) technique during training to improve its accuracy. In contrast, we only use the cross-entropy loss function for model parameter updating without deliberately employing any training tricks to enhance performance, showcasing the significant benefits of considering crack morphology during model design.

Qualitative Results. Figure 4, Figure 5 and Figure 6 display the qualitative results of all compared models. Our method achieves superior visual performance compared to the other models. From the first, second, and third rows of Figure 4 it can be observed that CrackScopeNet and the more significant parameter segmentation algorithms achieve satisfactory results for high-resolution images with apparent crack features. In the fourth row, where the original image contains asphalt with color and texture similar to cracks, CrackScopeNet and SegFormer successfully overcome the background noise interference. This is attributed to their long-range contextual dependencies, which allow them to effectively capture the relationships between cracks. In the fifth row, the results show that CrackScopeNet exhibits robust performance even under uneven illumination conditions. This can be attributed to the design of the network structure, which considers both the local and global features of cracks while effectively suppressing noise.

Figure 5 clearly shows that the lightweight networks struggle to eliminate background noise interference, leading to fragmented segmentation results for fine cracks. This outcome is due to the limited parameters learned by lightweight models. Finally, Figure 6 presents the visualization results of the most advanced crack segmentation models. U2Crack [52], based on the ViT [17] architecture, achieves a broader receptive field that somewhat alleviates background noise, though at the cost of significant computational overhead. HrSegNet [29] maintains a high-resolution branch to capture rich and detailed features. As seen in the last two columns of Figure 6, the increased number of channels in the HrSegNet network allow more detailed information to be extracted; however, this leads to background information being misclassified as cracks. This explains the high precision and low recall results of HrSegNet. In summary, CrackScopeNet outperforms the other segmentation models, demonstrating excellent crack detection performance under various noise conditions with lower parameters and FLOPs.

Inference on Navio2-based drones. In practical applications, there remains a substantial gap between real-time semantic segmentation algorithms designed and validated for mobile and edge devices, with the latter facing challenges such as limited memory resources and low computational efficiency. To better simulate edge devices used for outdoor structural health monitoring, we explored the inference speed of the models without GPU acceleration. Notably, to ensure that all models could be deployed on the drone platform without sacrificing accuracy through pruning or compression, we avoided using storage-intensive and computationally demanding models such as UNet, SegNet, and PSPNet. We converted the models to ONNX format and tested their inference speeds on Navio2-based drones equipped with a representative Raspberry Pi 4B, focusing on comparing our proposed model with models with tiny FLOPs and parameter counts: BiSeNetV2, DeepLabV3+, STDC, HrSegNetB48, SegFormer, TopFormer, SeaFormer. The test settings were as follows: input image size of 3 × 400 × 400, batch size of 1, and 2000 testing epochs. To ensure a fair comparison, we did not optimize or prune any models during deployment, meaning that the actual inference delay in practical applications could be further reduced from these test results.

As shown in Figure 7, the test results indicate that when running on a highly resource-constrained drone platform, the proposed CrackScopeNet architecture achieves faster inference speed compared to other real-time or lightweight semantic segmentation networks based on convolutional neural networks, such as BiSeNet, BiSeNetV2, and STDC. Additionally, TopFormer and SeaFormer, which are designed with deployment on resource-limited edge devices in mind, both achieve extremely low inference latency; however, these models perform poorly on the crack datasets due to inadequate data volume. Our proposed model achieves remarkable crack segmentation accuracy while maintaining rapid inference speed, establishing its advantages over competing models.

These results confirm the efficacy of deploying the CrackScopeNet model on outdoor mobile devices, where high-speed inference and lightweight architecture are crucial for real-time processing and analysis of infrastructure surface cracks. By outperforming other state-of-the-art models, CrackScopeNet proves to be a suitable solution for addressing the challenges associated with outdoor edge computing.

5.2. Scaling Study

To explore the adaptability of our model, we adjusted the number of channels and stacked different numbers of CrackScope modules to cater to a broader range of application scenarios. Because CrackSeg9k is composed of multiple crack datasets, we also investigated the model’s transferability to specific application scenarios.

We adjusted the base number of channels after the stem from 32 to 64. Correspondingly, the number of channels in the remaining three feature extraction stages was increased from (32, 64, 128) to (64, 128, 160) in order to capture more features. Meanwhile, the number of CrackScope modules stacked in each stage was adjusted from (3, 3, 4) to (3, 3, 3); we refer to the adjusted model as CrackScopeNet_Large. First, we trained CrackScopeNet_Large on CrackSeg9k using the same parameter settings as the base version and evaluated the model on the test set. Furthermore, we used the training parameters and weights obtained from CrackSeg9k for these two models as the basis for transferring the models to downstream tasks in two specific scenarios. Images from the Ozgenel dataset, which consist of high-resolution concrete crack images similar to some scenarios in CrackSeg9k, were cropped to 448 × 448. The Aerial Track Dataset consists of low-altitude drone-captured images of post-earthquake highway cracks, a type of scene not present in CrackSeg9k; these were cropped to 512 × 512.

Table 5 presents the mIoU scores, parameter counts, and FLOPs of the base CrackScopeNet model and the high-accuracy version CrackScopeNet_Large on the CrackSeg9k dataset and the two specific scenario datasets. In this table, mIoU(F) represents the mIoU score obtained after pretraining the model on CrackSeg9k and fine-tuning it on the respective dataset. It is evident that the large version of the model achieves higher segmentation accuracy across all datasets, though with approximately double the parameters and three times the FLOPs. Therefore, if computational resources and memory are sufficient and higher accuracy in crack segmentation is required, the large version or further stacking of CrackScope modules can be employed.

For specific scenario training, whether from scratch or fine-tuning, all our models were trained for only 20 epochs. It can be seen that the models converge quickly even when training from scratch. We attribute this phenomenon to the initial design of CrackScopeNet, which considers the morphology of cracks and is able to successfully capture the necessary contextual information. For training using transfer learning, both versions of the model achieve remarkable results on the Ozgenel dataset, with mIoU scores of 90.1% and 92.31%, respectively. Even for the Aerial Track dataset, which includes low-altitude remote sensing images of highway cracks not seen in CrackSeg9k, both of our models still perform exceptionally well, achieving respective mIoU scores of 83.26% and 84.11%. These results demonstrate the proposed model’s rapid adaptability to small datasets, aligning well with real-world tasks.

5.3. Diagnostic Experiments

To gain more insights into CrackScopeNet, a set of ablative studies were conducted on our proposed model. All the methods mentioned in this section were trained with the same parameters for efficiency in 200 epochs.

Stripwise Context Attention. First, we examined the role of the critical SWA module in CrackScopeNet by replacing it with two advanced attention mechanisms, CBAM [37] and CA [38]. The results shown in Table 6 demonstrate that without any attention mechanism, merely stacking convolutional neural networks for feature extraction yields poor performance due to the limited receptive field. Next, the SWA attention mechanism based on strip pooling and one-dimensional convolution was adopted, allowing the network structure to capture long-range contextual information. Under this configuration, the model exhibited the best performance. Figure 8 shows the class activation maps (CAM) [59] before the segmentation head of CrackScopeNet. It can be observed that the model without SWA is easily disturbed by shadows, whereas with the SWA module the model can focus on the global crack areas. Next, we sequentially replaced the SWA module with the channel–spatial feature-based CBAM attention mechanism and the coordinate attention (CA) mechanism, which also uses strip pooling. While the model parameters did not change significantly, the performance declined by 0.2% and 0.17%, respectively.

Furthermore, we explored the benefits of different attention mechanisms for other models by optimizing the advanced HrSegNetB48 lightweight crack segmentation network [29]. HrSegNetB48 consists of high-resolution and auxiliary branches, merging shallow detail information with deep semantic information at each stage. Therefore, we added the SWA, CBAM, and CA attention mechanisms after feature fusion to capture richer features. Table 7 shows the performance of HrSegNetB48 with the different attention mechanisms, clearly indicating that introducing the SWA attention mechanism to capture long-range contextual information provides the most significant benefit.

Multiscale Branch. Next, we examined the effect of the multiscale branch in the CrackScope module. To ensure fairness, we replaced the multiscale branch with a convolution of a larger kernel size (5 × 5 instead of 3 × 3). The results with and without the multiscale branch are shown in Table 6. It is evident that using a 5 × 5 kernel size convolution instead of the multiscale branch, decreases the mIoU score (−0.16%) despite having more floating-point computations. This demonstrates that blindly adopting large kernel convolutions increases computational overhead without significant performance improvements. The benefits conferred by the multiscale branch were further analyzed through the CAM. As shown in the third column of Figure 8, when the multiscale branch is not used, it is obvious that the network misses the feature information of small cracks, while the model with this branch can perfectly capture the features of cracks with various shapes and sizes.

Decoder. CrackScopeNet uses a simple decoder to fuse feature information of different scales, then complete the compression of channel features and the fusion of features at different stages. At present, the most popular decoders use an atrous spatial pyramid pooling (ASPP) [23] module to introduce multi-scale information. In order to explore whether the introduction of an ASPP module could benefit to our model and investigate the effectiveness of our proposed lightweight decoder, we replaced decoder with the ASPP method adopted by DeepLabV3+ [23]. The results are shown in the last two rows of Table 6. It can be seen that the computational overhead is large because of the need to perform parallel dilated convolution operations on deep semantic information; however, the performance of the model does not improve. This shows that using multiple sets of dilated convolutions to capture multiscale feature incurs additional computational overhead while not contributing to the performance improvement of our model

5.4. Experiment Conclusions

Based on the comparative experiments conducted in previous sections, CrackScopeNet demonstrates significant advantages over both classical and lightweight semantic segmentation models in terms of performance, parameter count, and FLOPs. On the composite CrackSeg9k dataset, CrackScopeNet achieves high segmentation accuracy and shows excellent transferability to specific scenarios. Notably, it maintains a low parameter count and minimal FLOPs, which translates to low-latency inference speeds on resource-constrained drone platforms without the need for GPU acceleration. This efficiency is achieved by considering crack morphology characteristics, allowing CrackScopeNet to remain lightweight and computationally efficient. This makes it particularly suitable for deployment on mobile devices in outdoor environments. In summary, CrackScopeNet achieves a better balance between segmentation accuracy and inference speed compared to the other networks examined in this study, making it a promising solution for timely crack detection and analysis using drones in infrastructure surfaces.

However, there remain some drawbacks in this study. The inference speed and performance were not tested on other hardware platforms, such as the Snapdragon 865, which may offer different computational capabilities. Additionally, our study did not explore the potential acceleration that NPUs (Neural Processing Units) or GPUs could provide. Further investigation into how these processing units can be fully utilized could offer significant improvements in the efficiency and performance of the model.

6. Discussion

In this paper, we present CrackScopeNet, a lightweight infrastructure surface crack segmentation network specifically designed to address the challenges posed by varying crack sizes, irregular contours, and subtle differences between cracks and normal regions in real-world applications. The proposed network structure captures the local context information and long-distance dependencies of cracks through a lightweight multiscale branch and an SWA attention mechanism, respectively, and effectively extracts the low-level details and high-level semantic information required for accurate crack segmentation.

In this work, we find that using channel-wise partitioning to apply different kernel sizes effectively captures multiscale features without introducing significant computational overhead. Additionally, by incorporating an attention mechanism that accounts for long-range dependencies, it is possible to compensate for the limitations of downsampling without resorting to additional detail branches, which would otherwise increase computational demands. Our experimental results demonstrate that CrackScopeNet delivers robust performance and high accuracy. It outperforms larger models like SegFormer in terms of efficiency, significantly reducing the number of parameters and computational cost. Furthermore, our method achieves faster inference speeds than other lightweight models such as BiSeNet and STDC even in the absence of GPU acceleration. This performance makes it highly suitable for deployment on resource-constrained drone platforms, enabling efficient and low-latency crack detection in structural health monitoring. By making the model and code publicly available, we aim to advance the application of UAV remote sensing technology in infrastructure maintenance, providing an efficient and practical tool for the timely detection and analysis of cracks.

Furthermore, utilizing UAVs to monitor crack development in geological disaster scenarios can greatly aid in warning efforts. CrackScopeNet, having proven effective in infrastructure crack detection, has the potential to be adapted for these contexts through domain adaptation. We have undertaken preliminary investigations by capturing images of hazardous rock formations with UAVs and using our model to extract crack regions, as illustrated in Figure 9. These environments present more intricate crack patterns, including various types and complex curved damage. Our approach currently exhibits limitations in detecting fine cracks, particularly those that blend with the background. Our next work will focus on enhancing the model sensitivity and capacity in order to accurately identify smaller and more complex crack patterns in challenging conditions, especially in geological disaster monitoring.

Lastly, in this era of large models, our model has only been trained and evaluated on datasets containing a few thousand images; the need for a large amount of data collection and manual labeling represents a bottleneck. Recent advances in generative AI and self-supervised learning can bypass the limitations imposed by the need for data acquisition and manual annotation. Researchers can use the inherent structure or attributes of existing data to generate richer “synthetic images” and “synthetic labels”, which is a very interesting research avenue that could be applied to crack detection.

Author Contributions

T.Z. designed the architecture and comparative experiments, and wrote the manuscript; L.Q. revised the manuscript and assisted T.Z. in conducting the experiments; Q.Z. and H.Z. made suggestion for the experiments and assisted in revising the manuscript; L.Z. and R.W. conducted investigation and code testing. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by research on identification and variation analysis methods for rock fractures, development of a real-time monitoring model for falling rocks based on machine vision, research project on hazard warning algorithm, and terminal equipment for rock collapse based on vibration data of Chongqing Institute of Geology and Mineral Resources, grant numbers F2023304, F2023045, and cstc2022jxjl00006. This work was supported by the 2024 Key Technology Project of Chongqing Municipal Education Commission, grant numbers KJZD-K202400204.

Data Availability Statement

The code and data that support the findings of this study are available on GitHub at https://github.com/ttkingzz/CrackScopeNet, accessed on 5 July 2024.

Acknowledgments

The authors would like to thank the editors and reviewers for their valuable suggestions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Minh Dang, L.; Wang, H.; Li, Y.; Nguyen, L.Q.; Nguyen, T.N.; Song, H.K.; Moon, H. Deep learning-based masonry crack segmentation and real-life crack length measurement. Constr. Build. Mater. 2022, 359, 129438. [Google Scholar] [CrossRef]
Zheng, M.; Lei, Z.; Zhang, K. Intelligent detection of building cracks based on deep learning. Image Vis. Comput. 2020, 103, 103987. [Google Scholar] [CrossRef]
Ha, J.; Kim, D.; Kim, M. Assessing severity of road cracks using deep learning-based segmentation and detection. J. Supercomput. 2022, 78, 17721–17735. [Google Scholar] [CrossRef]
Zhang, J.; Qian, S.; Tan, C. Automated bridge surface crack detection and segmentation using computer vision-based deep learning model. Eng. Appl. Artif. Intell. 2022, 115, 105225. [Google Scholar] [CrossRef]
Deng, J.; Singh, A.; Zhou, Y.; Lu, Y.; Lee, V.C.S. Review on computer vision-based crack detection and quantification methodologies for civil structures. Constr. Build. Mater. 2022, 356, 129238. [Google Scholar] [CrossRef]
Gavilán, M.; Balcones, D.; Marcos, O.; Llorca, D.F.; Sotelo, M.A.; Parra, I.; Ocaña, M.; Aliseda, P.; Yarza, P.; Amírola, A. Adaptive Road Crack Detection System by Pavement Classification. Sensors 2011, 11, 9628–9657. [Google Scholar] [CrossRef]
Jahanshahi, M.R.; Jazizadeh, F.; Masri, S.F.; Becerik-Gerber, B. Unsupervised Approach for Autonomous Pavement-Defect Detection and Quantification Using an Inexpensive Depth Sensor; American Society of Civil Engineers: Reston, VA, USA, 2012. [Google Scholar]
Zhang, D.; Zou, Q.; Lin, H.; Xu, X.; He, L.; Gui, R.; Li, Q. Automatic pavement defect detection using 3D laser profiling technology. Autom. Constr. 2018, 96, 350–365. [Google Scholar] [CrossRef]
Iyer, S.; Sinha, S.K. Segmentation of Pipe Images for Crack Detection in Buried Sewers. Comput.-Aided Civ. Infrastruct. Eng. 2006, 21, 395–410. [Google Scholar] [CrossRef]
Sun, B.C.; Qiu, Y.J. Automatic Identification of Pavement Cracks Using Mathematic Morphology. In Proceedings of the First International Conference on Transportation Engineering, Chengdu, China, 22–24 July 2007. [Google Scholar]
Kamaliardakani, M.; Sun, L.; Ardakani, M.K. Sealed-Crack Detection Algorithm Using Heuristic Thresholding Approach. J. Comput. Civ. Eng. 2016, 30, 04014110. [Google Scholar] [CrossRef]
Mohan, A.; Poobal, S. Crack detection using image processing: A critical review and analysis. Alex. Eng. J. 2018, 57, 787–798. [Google Scholar] [CrossRef]
Qu, Z.; Lin, L.D.; Guo, Y.; Wang, N. An improved algorithm for image crack detection based on percolation model. Comput.-Aided Civ. Infrastruct. Eng. 2015, 10, 214–221. [Google Scholar] [CrossRef]
Cha, Y.J.; Ali, R.; Lewis, J.; Büyüköztürk, O. Deep learning-based structural health monitoring. Autom. Constr. 2014, 161, 105328. [Google Scholar] [CrossRef]
Liu, Z.; Cao, Y.; Wang, Y.; Wang, W. Computer vision-based concrete crack detection using U-net fully convolutional networks. Autom. Constr. 2019, 104, 129–139. [Google Scholar] [CrossRef]
Yang, J.; Wang, W.; Lin, G.; Li, Q.; Sun, Y.; Sun, Y. Infrared Thermal Imaging-Based Crack Detection Using Deep Learning. IEEE Access 2019, 7, 182060–182077. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.: New York, NY, USA, 2021; Volume 34, pp. 12077–12090. [Google Scholar]
Lin, Q.; Li, W.; Zheng, X.; Fan, H.; Li, Z. DeepCrackAT: An effective crack segmentation framework based on learning multi-scale crack features. Eng. Appl. Artif. Intell. 2023, 126, 106876. [Google Scholar] [CrossRef]
Yang, F.; Zhang, L.; Yu, S.; Prokhorov, D.; Mei, X.; Ling, H. Feature Pyramid and Hierarchical Boosting Network for Pavement Crack Detection. IEEE Trans. Intell. Transp. Syst. 2020, 21, 1525–1535. [Google Scholar] [CrossRef]
Chu, H.; Wang, W.; Deng, L. Tiny-Crack-Net: A multiscale feature fusion network with attention mechanisms for segmentation of tiny cracks. Comput.-Aided Civ. Infrastruct. Eng. 2022, 37, 1914–1931. [Google Scholar] [CrossRef]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Guo, M.H.; Lu, C.Z.; Hou, Q.; Liu, Z.; Cheng, M.M.; Hu, S.m. SegNeXt: Rethinking Convolutional Attention Design for Semantic Segmentation. arXiv 2022, arXiv:2209.08575. [Google Scholar]
Duan, Z.; Liu, J.; Ling, X.; Zhang, J.; Liu, Z. ERNet: A Rapid Road Crack Detection Method Using Low-Altitude UAV Remote Sensing Images. Remote Sens. 2024, 16, 1741. [Google Scholar] [CrossRef]
Forcael, E.; Román, O.; Stuardo, H.; Herrera, R.F.; Soto-Muñoz, J. Evaluation of Fissures and Cracks in Bridges by Applying Digital Image Capture Techniques Using an Unmanned Aerial Vehicle. Drones 2024, 8, 8. [Google Scholar] [CrossRef]
Paszke, A.; Chaurasia, A.; Kim, S.; Culurciello, E. ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation. arxiv 2016, arXiv:1606.02147. [Google Scholar]
Yu, C.; Wang, J.; Peng, C.; Gao, C.; Yu, G.; Sang, N. Bisenet: Bilateral segmentation network for real-time semantic segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 325–341. [Google Scholar]
Li, Y.; Ma, R.; Liu, H.; Gaoli, C. Real-time high-resolution neural network with semantic guidance for crack segmentation. Autom. Constr. 2023, 156, 105112. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015; Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F., Eds.; Springer International Publishing: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar] [CrossRef]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar]
Ding, X.; Zhang, X.; Han, J.; Ding, G. Scaling up your kernels to 31 × 31: Revisiting large kernel design in cnns. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11963–11975. [Google Scholar]
Zhang, W.; Huang, Z.; Luo, G.; Chen, T.; Wang, X.; Liu, W.; Yu, G.; Shen, C. TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 12083–12093. [Google Scholar]
Wan, Q.; Huang, Z.; Lu, J.; Yu, G.; Zhang, L. SeaFormer: Squeeze-enhanced Axial Transformer for Mobile Semantic Segmentation. In Proceedings of the International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
Tang, Y.; Han, K.; Guo, J.; Xu, C.; Xu, C.; Wang, Y. GhostNetv2: Enhance cheap operation with long-range attention. arXiv 2022, arXiv:2211.12905. [Google Scholar]
Kulkarni, S.; Singh, S.; Balakrishnan, D.; Sharma, S.; Devunuri, S.; Korlapati, S.C.R. CrackSeg9k: A Collection and Benchmark for Crack Segmentation Datasets and Frameworks. In Proceedings of the Computer Vision—ECCV 2022 Workshops; Karlinsky, L., Michaeli, T., Nishino, K., Eds.; Springer: Cham, Switzerland, 2023; pp. 179–195. [Google Scholar]
Dais, D.; Bal, E.; Smyrou, E.; Sarhosis, V. Automatic crack classification and segmentation on masonry surfaces using convolutional neural networks and transfer learning. Autom. Constr. 2021, 125, 103606. [Google Scholar] [CrossRef]
Shi, Y.; Cui, L.; Qi, Z.; Meng, F.; Chen, Z. Automatic Road Crack Detection Using Random Structured Forests. IEEE Trans. Intell. Transp. Syst. 2016, 17, 3434–3445. [Google Scholar] [CrossRef]
Zou, Q.; Cao, Y.; Li, Q.; Mao, Q.; Wang, S. CrackTree: Automatic crack detection from pavement images. Pattern Recognit. Lett. 2012, 33, 227–238. [Google Scholar] [CrossRef]
Pak, M.; Kim, S. Crack Detection Using Fully Convolutional Network in Wall-Climbing Robot. In Advances in Computer Science and Ubiquitous Computing; Park, J.J., Fong, S.J., Pan, Y., Sung, Y., Eds.; Springer: Singapore, 2021; pp. 267–272. [Google Scholar]
Liu, Y.; Yao, J.; Lu, X.; Xie, R.; Li, L. DeepCrack: A deep hierarchical feature learning architecture for crack segmentation. Neurocomputing 2019, 338, 139–153. [Google Scholar] [CrossRef]
Junior, G.S.; Ferreira, J.; Millán-Arias, C.; Daniel, R.; Junior, A.C.; Fernandes, B.J.T. Ceramic Cracks Segmentation with Deep Learning. Appl. Sci. 2021, 11, 6017. [Google Scholar] [CrossRef]
Dorafshan, S.; Thomas, R.J.; Maguire, M. SDNET2018: An annotated image dataset for non-contact concrete crack detection using deep convolutional neural networks. Data Brief 2018, 21, 1664–1668. [Google Scholar] [CrossRef]
Eisenbach, M.; Stricker, R.; Seichter, D.; Amende, K.; Debes, K.; Sesselmann, M.; Ebersbach, D.; Stoeckert, U.; Gross, H.M. How to get pavement distress detection ready for deep learning? A systematic approach. In Proceedings of the 2017 International Joint Conference on Neural Networks (IJCNN), Anchorage, AK, USA, 14–19 May 2017; pp. 2039–2047. [Google Scholar] [CrossRef]
Özgenel, F. Concrete Crack Segmentation Dataset. Mendeley Data 2019. [Google Scholar] [CrossRef]
Hong, Z.; Yang, F.; Pan, H.; Zhou, R.; Zhang, Y.; Han, Y.; Wang, J.; Yang, S.; Chen, P.; Tong, X.; et al. Highway Crack Segmentation From Unmanned Aerial Vehicle Images Using Deep Learning. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
Liu, Y.; Chu, L.; Chen, G.; Wu, Z.; Chen, Z.; Lai, B.; Hao, Y. PaddleSeg: A High-Efficient Development Toolkit for Image Segmentation. arxiv 2021, arXiv:2101.06175. [Google Scholar]
Shi, P.; Zhu, F.; Xin, Y.; Shao, S. U2CrackNet: A deeper architecture with two-level nested U-structure for pavement crack detection. Struct. Health Monit. 2023, 22, 2910–2921. [Google Scholar] [CrossRef]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef] [PubMed]
Yu, C.; Gao, C.; Wang, J.; Yu, G.; Shen, C.; Sang, N. BiSeNet V2: Bilateral Network with Guided Aggregation for Real-Time Semantic Segmentation. Int. J. Comput. Vis. 2021, 129, 3051–3068. [Google Scholar] [CrossRef]
Fan, M.; Lai, S.; Huang, J.; Wei, X.; Chai, Z.; Luo, J.; Wei, X. Rethinking BiSeNet for Real-Time Semantic Segmentation. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 9716–9725. [Google Scholar]
Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 3213–3223. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; Torralba, A. Learning Deep Features for Discriminative Localization. arXiv 2015, arXiv:1512.04150. [Google Scholar]

Figure 1. Comparison between classical and lightweight semantic segmentation networks and CrackScopeNet on CrackSeg9k dataset.

Figure 2. A structural overview of CrackScopeNet. (a) CrackScopeNet consists of three downsampling stages, and each stage contains N CrackScope modules and MLP modules (we refer to these two combined modules as a CrackScope block). A CrackScope module implies a multiscale branch (b) in which the input is divided into three parts along the channel dimension and an SWA module (c). The upsampling block (d) upsamples deep features and stacks them with shallow information, using SWA modules and convolutional layers for feature fusion.

Figure 3. Samples from three crack datasets. The first line is the original images, while the second includes the overlay effect of the masks and the original images. Samples from the CrackSeg9k dataset (a), samples from the Ozgenel dataset (b), and samples from the Aerial Track dataset (c).

Figure 4. Visualization of the segmentation results of the classical segmentation models and our model on the CrackSeg9k test set.

Figure 5. Visual segmentation results of the lightweight segmentation models and our model on the CrackSeg9k test set.

Figure 6. Visual segmentation results of the crack-specific segmentation models on the CrackSeg9k test set.

Figure 7. Results of inference speed test on Navio2-based drones.

Figure 8. Visual explanations of the different components of CrackScopeNet.

Figure 9. Application of CrackScopeNet for crack detection in dangerous rock masses.

Table 1. Instance of CrackScopeNet.

Stage	Downsampling			Upsampling			Output Size
Stage	Operation	C_in	C_out	Operation	C_in	C_out
S0	Input		3	Seg Head	96	2	400 × 400
S1	Stem	3	32	Concatenate	64	96	100 × 100
S2	CS Stage × 3	32	32	Up-samp.	128	64	100 × 100
S3	Donw-samp.	32	64	Concatenate	64	128	50 × 50
S3	CS Stage × 3	64	64	Up-samp.	128	64	50 × 50
S4	Donw-samp.	64	128				25 × 25
S4	CS Stage × 4	128	128				25 × 25

Table 2. Sub-datasets in CrackSeg9k.

Name	Number	Material
Masonry [41]	240	Masonry structures
CFD [42]	118	Paths and sidewalks
CrackTree200 [43]	175	Pavement
Volker [44]	427	Concrete structures
DeepCrack [45]	443	Concrete and asphalt surfaces
Ceramic [46]	100	Ceramic tiles
SDNET2018 [47]	1411	Building facades, bridges, sidewalks
Rissbilder [44]	2736	Building surfaces (walls, bridges)
Crack500 [21]	3126	Pavement
GAPS384 [48]	383	Pavement and concrete surfaces

Table 3. Parameter settings for training on the CrackSeg9k dataset.

Item	Setting
Epoch	200
Batch Size	16
Optimizer	Adamw
Weight decay	0.01
Beta1	0.9
Beta2	0.999
Initial learning rate	0.005
Learning rate decay type	poly
GPU memory	12 GB
Image size	400 × 400

Table 4. Performance of different methods and our method on the CrackSeg9k dataset.

Model	mIoU	Pr (%)	Re (%)	F1 (%)	Parameters	FLOPs
Classical
U-Net [30]	81.36	90.60	87.00	88.76	13.40 M	75.87 G
PSPNet [31]	81.69	89.19	88.72	88.95	21.06 M	54.20 G
SegNet [53]	80.50	89.71	86.57	88.11	29.61 M	103.91 G
DeepLabV3+ [23]	80.96	88.56	88.29	88.42	2.76 M	2.64 G
SegFormer [19]	81.63	89.82	88.05	88.92	3.72 M	4.13 G
SegNext [24]	81.55	89.28	88.44	88.86	4.23 M	3.72 G
Lightweight
BiSeNet [28]	81.01	89.74	87.26	88.48	12.93 M	34.57 G
BiSeNetV2 [54]	80.66	89.36	87.11	88.22	2.33 M	4.93 G
STDC [55]	80.84	88.92	87.76	88.34	8.28 M	5.22 G
STDC2 [55]	80.94	89.54	87.33	88.42	12.32 M	7.26 G
TopFormer [34]	80.96	89.28	87.60	88.43	5.06 M	1.00 G
SeaFormer [35]	79.13	87.29	87.19	87.20	4.01 M	0.64 G
Specific
U2Crack [52]	81.45	90.13	87.52	88.80	1.19 M	31.21 G
HrSegNetB48 [29]	81.07	90.39	86.78	88.55	5.43 M	5.59 G
HrSegNetB64 [29]	81.28	90.44	87.03	88.70	9.65 M	9.91 G
CrackScopeNet	82.15	89.34	89.24	89.29	1.05 M	1.58 G

Table 5. Evaluation results of the two versions of our model on three different datasets; CSNet and CSNet_L stand for CrackScopeNet and CrackScopeNet_Large, while mIoU(F) indicates the mIoU score for models pretrained on the CrackSeg9k dataset.

Model	CrackSeg9k			Ozgenel			Aerial Track Dataset
Model	mIoU	Param	FLOPs	mIoU	mIoU(F)	FLOPs	mIoU	mIoU(F)	FLOPs
CSNet	82.15%	1.05 M	1.58 G	90.05%	92.11%	1.98 G	79.12%	82.63%	2.59 G
CSNet_L	82.48%	2.20 M	5.09 G	90.71%	92.36%	6.38 G	81.04%	83.43%	8.33 G

Table 6. Ablation study on the effectiveness of each component in CrackScopeNet.

Attention			Mutil-Branch	Decoder		mIoU (%)	FLOPs (G)
SWA	CA	CBAM	Mutil-Branch	Ours	ASPP	mIoU (%)	FLOPs (G)
				✓		81.34	1.57
	✓		✓	✓		81.98	1.58
		✓	✓	✓		81.95	1.58
✓				✓		81.91	1.61
✓			✓		✓	82.14	2.89
✓			✓	✓		82.15	1.58

Table 7. Results when adding different attention mechanisms to HrSegNet.

Model	mIoU (%)	Pr (%)	Re (%)	F1 (%)	Params	FLOPs
HrSegNetB48	81.07	90.39	86.78	88.55	5.43 M	5.59 G
HrSegNetB48+CBAM	81.16	90.40	86.90	8.61	5.44 M	5.60 G
HrSegNetB48+CA	81.20	90.24	87.08	88.63	5.44 M	5.60 G
HrSegNetB48+SWA	81.72	89.65	88.33	88.98	5.48 M	5.60 G

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, T.; Qin, L.; Zou, Q.; Zhang, L.; Wang, R.; Zhang, H. CrackScopeNet: A Lightweight Neural Network for Rapid Crack Detection on Resource-Constrained Drone Platforms. Drones 2024, 8, 417. https://doi.org/10.3390/drones8090417

AMA Style

Zhang T, Qin L, Zou Q, Zhang L, Wang R, Zhang H. CrackScopeNet: A Lightweight Neural Network for Rapid Crack Detection on Resource-Constrained Drone Platforms. Drones. 2024; 8(9):417. https://doi.org/10.3390/drones8090417

Chicago/Turabian Style

Zhang, Tao, Liwei Qin, Quan Zou, Liwen Zhang, Rongyi Wang, and Heng Zhang. 2024. "CrackScopeNet: A Lightweight Neural Network for Rapid Crack Detection on Resource-Constrained Drone Platforms" Drones 8, no. 9: 417. https://doi.org/10.3390/drones8090417

APA Style

Zhang, T., Qin, L., Zou, Q., Zhang, L., Wang, R., & Zhang, H. (2024). CrackScopeNet: A Lightweight Neural Network for Rapid Crack Detection on Resource-Constrained Drone Platforms. Drones, 8(9), 417. https://doi.org/10.3390/drones8090417

Article Menu

CrackScopeNet: A Lightweight Neural Network for Rapid Crack Detection on Resource-Constrained Drone Platforms

Abstract

1. Introduction

2. Related Work

2.1. High-Accuracy Models

2.2. Lightweight Models

2.3. Attention Mechanism

3. Methods

3.1. Model Overview

3.2. Feature Extraction

3.3. Stripwise Context Attention

3.4. Feature Fusion

3.5. Auxiliary Branch

4. Experimental Datasets and Setup

4.1. Datasets

4.2. Parameter Setting

4.2.1. Training and Fine-Tuning Configurations

4.2.2. Drone Platform

4.3. Evaluation Metrics

5. Experiment

5.1. Comparative Experiments

5.2. Scaling Study

5.3. Diagnostic Experiments

5.4. Experiment Conclusions

6. Discussion

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI