1. Introduction
In practical engineering applications, the service life of metal structural components usually directly determines whether engineering equipment can reliably, stably, and efficiently complete construction tasks. Since the emergence and expansion of micro-cracks typically harm the metal materials’ strength, indirectly shortening the service life of critical structural components and even the equipment itself, it is necessary to identify the early emergence of micro-cracks and repair or replace the affected parts during the engineering equipment service process to reduce the economic losses caused by fatigue failure. Therefore, the efficient and accurate identification and detection of micro-cracks on the surface of metal structural components has significant application potential and research value.
In the earlier days, the artificial visual method was mainly used to detect the micro-cracks on the metal structural components’ surface. Still, limited by a series of factors such as the technology of the inspectors and the lighting conditions at the site, it is difficult to ensure the accuracy and speed of the inspection using the artificial visual inspection method. Subsequently, many scholars proposed a variety of sensor-based micro-crack detection methods, such as ultrasonic, eddy current, acoustic emission, etc. However, there are certain limitations, such as the object material, surface topography, and other factors, resulting in a general lack of flexibility of the sensor-based crack detection methods, and making it challenging to improve the detection accuracy, speed, and adaptability.
Compared to various sensors, using machine vision methods to acquire crack images and perform a series of processing tasks can circumvent some constraints. The core of traditional machine vision-based algorithms for small target detection lies in using feature extractors for image feature extraction, which includes three main steps: selecting candidate target regions, extracting features, and designing target classifiers. Many scholars have proposed corresponding optimization methods built around the above three steps, such as more flexible sliding windows [
1], an extraction method based on multi-scale features [
2], and the better performance classification method XGBoost [
3] (XGBoost is a scalable machine learning system for tree boosting and is an improved version of GBDT (Gradient Boosting Decision Tree), where X stands for eXtreme. XGBoost is a widely used machine learning method, which can train models faster and more efficiently). However, even after nearly 20 years of development, traditional methods still have specific weaknesses, such as using sliding windows to select the target region producing more redundant windows, resulting in the overall process being complex; as such the sliding window scale flexibility, limitations, and robustness are poor. In addition, the features based on low-level visual information are difficult to adapt to complex and changeable scenes, which ultimately means that the efficiency and accuracy of the traditional method in dealing with the small targets detection task can be little further improved, and as such the promotion and application of the method is restricted.
Several researchers have combined machine vision with traditional analytical methods to investigate the detection of micro-cracks. Landstrom et al. [
4] proposed an automated online crack detection method based on the crack detection task on steel plate surfaces using morphological image processing and statistical classification based on logistic regression, which successfully identified more than 70% of the manually labeled crack lengths, missing only a few crack regions that contained shorter crack segments. Cubero-Fernandez et al. [
5] proposed a single-step crack detection and classification method based on algorithms such as logarithmic transformation, bilateral filter, Canny operator, and morphological filtering to achieve an automated pavement crack detection system free from manual operation.
In recent years, with the rapid improvement in the performance of CPUs, GPUs, and other computing units, as well as camera CMOSs, optical imaging lenses, and other imaging hardware, both the field of computer vision and that of deep learning have witnessed rapid development. The traditional target detection task has gained a new direction and ideas for exploration.
Krizhevsky [
6] proposed the first deep convolutional neural network, AlexNet, using ImageNet as a training dataset, which achieved breakthrough detection results on the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). The successful application of AlexNet in classification tasks became a starting point, and since then, deep learning has been widely applied and developed in other visual tasks.
Target detection technology based on deep learning can use the multi-structured network model and its powerful training algorithm to adaptively learn the images’ high-level semantic information. After extracting the image features and inputting them to the classification network to complete the classification and localization task of the target, the method effectively improves the accuracy and efficiency of the target detection, so target detection technology based on deep learning has become a hotspot of attention for many research scholars.
According to whether there is a candidate box generation step in the target detection process, the target detection algorithms can be divided into two-stage and single-stage detection algorithms, where the steps of the two-stage detection algorithm include (1) generating candidate regions and (2) determining candidate region categories and fine-tuning the bounding box. The single-stage detection algorithm discards the candidate region generation stage and directly defaults to all locations on the image as potential candidate regions and attempts to categorize each region of interest as either a background or target object.
Through the continuous research of scholars, there have been many excellent models of target detection algorithms. Two-stage algorithms include Fast R-CNN [
7], Faster R-CNN [
8], SPPNet [
9], R-FCN [
10], etc., whose distinctive feature is that the detection accuracy is higher, but the speed is slightly slower, which makes it difficult to deal with detection tasks that have a high real-time requirement. Single-stage detection algorithms include the YOLO [
11] series, SSD [
12], DSSD [
13], RetinaNet [
14], RefineNet [
15], etc., whose significant feature is that the detection speed is very high, but the detection accuracy is slightly insufficient compared to two-stage detection algorithms.
Although the above algorithms can identify and classify targets with high efficiency, they cannot distinguish different individuals in the same class of targets. Mask R-CNN, proposed by He et al. [
16], is based on Faster R-CNN extended with a branch for instance mask prediction, which is parallel to the existing bracketed frame regression branch and classification branch and thus can simultaneously perform target detection and instance segmentation. Furthermore, it adds instance segmentation loss in the loss function, so Mask R-CNN has faster detection speed and good detection effect in the field of pose estimation and character key point detection. Mask R-CNN is one of the most commonly used image instance segmentation algorithms, and it is widely used in defective target segmentation tasks, such as road surfaces [
17], industrial manufacturing [
18], bolt fasteners [
19], leather surfaces [
20], etc. However, the above studies mainly focus on the target detection tasks with closer horizontal and vertical dimensions and more prominent features, while the model still struggles with target detection tasks such as micro-cracks, which have significant differences in horizontal and vertical dimensions and features that are easily lost.
In recent years, many scholars have carried out various studies in order to improve the recognition performance of Mask R-CNN. Zhou et al. [
21] proposed an improved Mask R-CNN model based on attention, rotation, and genetic algorithm for defect detection of damaged insulators in power equipment. The model’s sensitivity to small targets can be significantly improved by modifying the backbone, and the location of small targets can be identified more quickly. Meanwhile, the genetic algorithm and gradient descent algorithm are combined to optimize the model’s hyperparameters. The model training results are as close to the global optimal as possible. Shen et al. [
22] proposed an unsound wheat kernel recognition algorithm based on improved Mask R-CNN for the needs of rapid wheat rating. By optimizing the structure of FPN and RPN, and adding attention mechanism, the model can identify unsound wheat kernels faster and more accurately, with an accuracy rate of 86%, a recall rate of 91%, and the inference speed of a single image reaching 7.83 s. Aiming at the problems of low efficiency and low accuracy during the manual maintenance of railway switches, Wei et al. [
23] proposed a low-complexity accurate ranging algorithm based on Mask R-CNN. The region of interest is segmented twice through the interactive iterative method, the image distortion is corrected according to the vertex mapping principle, and the accurate actual distance can be calculated by fitting the linear distance transformation equation. Finally, it can achieve accurate calculation of small targets and accurately measure the distance between different working parts.
At the same time, Mask R-CNN can be well applied to the identification, detection, and analysis of specific small target objects. Aiming at the statistical requirements of determining the number of air conditioners in use, Yang et al. [
24] established the feature detection dataset of external air conditioners, proposed the automatic search algorithm of urban air conditioners based on Mask R-CNN and YOLOv5, and explored the feasibility of using street view images to identify the number of air conditioners. Aiming at the demand for sows’ real-time detection and recognition in the process of large-scale sow breeding, Lei et al. [
25] established a multi-objective sow detection and recognition model based on improved Mask R-CNN and UNet-Attention deep learning algorithms, which could identify the contour of sows and analyze the area distribution of sows in the pen. At the same time, it can also recognize the behavior of the sow, such as eating, fetching water, and lying down, and the final recognition rate reached 96.8%.
In recent years, some scholars have also explored new methods for crack detection. Wang et al. [
26] presented a SIFT matching method based on an alternate-selection strategy, solved the problems of local information losses and partial refinement capacity reductions which are frequently encountered in the crack detection algorithms of deep learning methods. Luo et al. [
27] reviewed three significant aspects of CV-based methods, including surface defect detection, vibration measurement, and vehicle parameter identification, which aimed to provide guidance for selecting appropriate CV-based methods for bridge inspection and monitoring. Computer vision and deep learning methods give crack detection more space to explore.
By researching traditional and deep learning-based target detection methods, it can be found that although the current mainstream target detection algorithms have been able to achieve a better balance between target detection speed and accuracy, there are still specific problems in the presence of tiny targets, such as the lack of high-quality open datasets of tiny target defect samples, the difficulty of accurately extracting tiny target features, features that tend to be submerged in the complex background and texture, and tiny target detection accuracy that still has space for improvement. Therefore, this paper focuses on the above difficulties, takes micro-cracks on the surface of metal structural parts as the research object, collects micro-crack images in advance to form a certain scale of micro-crack dataset, and carries out data enhancement, at the same time conducting optimization research on micro-crack detection algorithms based on the deep learning method, to provide methodological support for improving the accuracy of detecting micro-cracks on the surface of metal structural parts under different working conditions.
3. Improvement of Mask R-CNN Network
With the addition of FPN and ROI Align, Mask-RCNN has a high instance segmentation accuracy for routine target detection tasks. However, Backbone in traditional Mask R-CNN often performs poorly when dealing with micro-crack detection tasks. It is easy to have inaccurate detection frames surround in the detection stage and inaccurate segmentations in the target segmentation stage. Therefore, this section will focus on the above problems to optimize and improve the Mask R-CNN structure to improve the model feature extraction performance.
3.1. Improved FPN Model
The Backbone of Mask R-CNN uses the ResNet50/101 + FPN scheme. Although the FPN module combines deep and shallow features, it still has the problem of insufficient utilization of multi-scale features.
The original feature map transfer roadmap of FPN is shown in
Figure 4a, where blue arrows represent the top-down feature fusion path. When dealing with the detection task of small defect targets, it is difficult for only one feature fusion path to make sufficient use of the underlying detail features, and the long connection path from the bottom layer to the top layer weakens the underlying detail features, reducing the feature extraction performance. Therefore, based on the original feature map transfer roadmap of FPN, two additional feature fusion paths are added in this section, as shown in
Figure 4b, where the orange arrow indicates a bottom-top feature fusion path, and the red arrow indicates an additional top-bottom feature fusion path.
In the original feature map transfer roadmap of FPN (
Figure 5a),
is used as an effective feature layer for the network to get the suggestion box, but in the improved feature map transfer roadmap in this section, the original
will be used as an intermediate quantity and continue to be passed backward. We write
as
. Firstly, adjust
to the same size as
by using the 3 × 3 convolution kernel with step size 2, directly add
and
pixel by pixel (the channel numbers of
are all 256) to obtain the effective feature layers
(where
is directly obtained from
). Then, an effective feature layer with a higher resolution will be obtained through 2× up-sampling again to
. Then, when the size of
is consistent with
, we add
and
together, and a 3 × 3 convolution operation is performed on the new fusion feature layer to reduce the aliasing effect caused by the up-sampling process. The final new effective feature layers
is obtained (
is obtained from
directly through 3 × 3 convolution operation, and
is obtained from
through maximum pooling).
Compared with the original FPN effective feature layers, the new effective feature layers obtained through the improved feature map transfer roadmap have richer semantic and detailed information, which is conducive to improving the detection and location accuracy of surface micro-cracks.
3.2. Improved ResNet Model
For micro-defect detection tasks, it is usually necessary to extract and combine feature information and context information of different scales to identify target objects, so perceiving different scales’ information is crucial for target classification and semantic segmentation. The basic micro-crack’s context information often occupies a larger area than the micro-crack itself. For example, when the micro-crack exists on a slab, the deep learning network model will better judge whether the abnormal area is a micro-crack according to the context of the large-scale slab.
To enhance the capability of the feature extraction network at different scales, this section draws on the parallel network idea of GoogLeNet to construct hierarchical residual connections within a single residual block of the ResNet, so that a residual block can realize the feature extraction of multiple receptor fields.
Figure 5a is the residual module in ResNet, and
Figure 5b is the improved hierarchical residual module proposed in this section.
It can be seen from
Figure 5b that this section replaces the original 3 × 3 convolution blocks with a set of smaller convolution blocks. When the input feature graph
is convolved by 1 × 1, the channel dimension changes and the input is divided into
s groups (where
s = 4), where each group of input is represented by
. The 3 × 3 convolution of group i is represented by
, and the output of group i after
is represented by
, where
can directly obtain
without undergoing 3 × 3 convolution, and
can obtain
after undergoing 3 × 3 convolution. At the same time, the local residual structure is added in the process of obtaining
from
and
in order to increase the scales’ number that can be expressed by the output features.
is fed into
after adding the output of the 3 × 3 convolution to
. The above process can be shown as Equation (3).
Output
can be obtained by splicing the outputs of
s parts according to the channel dimension. The channel dimension is raised by 1 × 1 convolution layer, and then added with residual branches. Output
can be obtained with the ReLU activation function. By establishing hierarchical residual connections within residual blocks, the feature information of multiple scales can be obtained within one residual block, as shown in Equations (4)–(7), where
represents the symbol of the convolution operation.
Each can receive the output features of the previous convolution layer, resulting in a larger receiving domain than the original 3 × 3 convolution operation, and the output of the hierarchical residual structure can contain more scale context information due to the combined explosion effect.
3.3. Deformable Convolution Model
In mathematics, the standard definition of a convolution kernel is the integral of the two functions’ product after inversion and shift, as shown in Equation (8). Usually, the function
is referred to as the filter, and the function
is referred to the raw data of the signal or image.
In CNN, the convolution kernel is essentially a filter. When processing images, the convolutional kernel is used to obtain a weighted average of pixel values in a small region in a given input image, and the output image result is the corresponding pixel value. It can extract different local features and generate lots of neurons by setting up different forms of convolutional kernels. It can construct a convolutional neural network through deep connection.
In general, a larger convolution kernel obtains a larger receptive field, meaning more picture information can be seen and more global features can be obtained. However, a large convolution kernel will lead to a significant increase in computation and a decrease in performance. The traditional convolution kernel size is usually a standard rectangle, such as 1 × 1, 3 × 3, 5 × 5, etc. In the actual target detection task, the convolution kernel size can be adjusted according to the target characteristics. However, when the target shape changes, or the shape is irregular (for example, when the length and width of the target are very different, such as micro-cracks), the traditional rectangular convolution kernel easily misses the target features during the convolution process, which reduces the feature extraction performance. Therefore, a deformable convolution kernel is added in this section to replace the rectangular convolution kernel in the original Mask R-CNN to improve the feature extraction performance for small targets.
The idea of deformable convolutional networks was proposed by Dai et al. [
30] in an ICCV 2017 article, and in this article the authors proposed a deformable convolutional network called DCNv1. The core idea is to add an offset direction parameter based on the traditional convolution kernel and learn the convolution kernel and offset simultaneously. The calculation process from standard convolution kernel to deformable convolution kernel can be shown as Equations (9) and (10), where
is a set of pixels sampled from the input feature map,
is the center point in the feature map,
is a point in the feature map,
is the weight of convolution kernel, and
is the shift matrix of point
. The schematic diagram is shown in
Figure 6. Deformable convolutional kernel DCNv1 can bring a freer receptive field, easily replace standard modules in existing CNNs, and learn the effectiveness of dense space transformation for complex visual tasks in deep convolutional neural networks.
However, the offset module in DCNv1 generate a large amount of context-free information, which is unfavorable for small target detection tasks. Wang et al. [
31] proposed DCNv2, which adds more deforming convolution layers and allows the model to learn not only the offset but also the weight of each sampling point, effectively reducing the interference of irrelevant information. The calculation process of DCNv2 from input to output can be expressed by Equation (11), where
in Equation (11) is the weight coefficient added on the basis of Equation (10), and its value range is [0,1].
Based on the modification of the residual structure of ResNet in
Section 3.1, the standard 3 × 3 convolution in the hierarchical residual structure is replaced with deformable convolution, as shown in
Figure 7a,b.
In DCNv2, all 3 × 3 convolution kernels are replaced with deformable convolution from conv_3 to conv_5 of the ResNet, which achieves good performance improvement in the COCO dataset. However, with the increase in network depth, the irregularity of target features gradually decreases in the process of layer-by-layer convolution. Therefore, considering the computation increase and the performance improvement after adding convolutional layers, this section only replaces the 3 × 3 convolution with deformable convolution in conv_2 and conv_3 of the ResNet in Mask R-CNN, keeping conv_1, conv_4, and conv_5 unchanged. The improved ResNet structure is shown in
Figure 7c, where Res-DCN represents a hierarchical residual module with deformable convolution added.
3.4. Attention Mechanisms Model
In the micro-defect target detection task, micro-cracks account for a small proportion of pixels in the image, and are usually elongated in shape, with weak visual features, and easily interfered with by complex backgrounds. The effective feature information that can be extracted from images in deep convolutional networks is very limited and is more likely to be lost. Therefore, this section adds an attention mechanism based on the original Mask R-CNN to enhance the network’s target attention to small defects and to appropriately weaken its attention to other information.
The addition of attention mechanisms in neural networks can flexibly adjust the proportion of weight value to allocate computing resources to more critical tasks and reduce the attention paid to other information in the case of limited computing power in order to improve the efficiency and accuracy of task processing.
Mainstream attention mechanism models mainly include channel attention, spatial attention, CBAM attention mechanism, etc. Among them, the channel attention mechanism allocates different attention weights according to the difference in the importance degree of other channels, while the spatial attention mechanism allocates different attention weights according to the difference in the importance degree of different regions. The purpose of both of them is to realize the rational allocation of model computing resources and to achieve as many effects as possible with limited resources.
CBAM module is a hybrid attention mechanism based on SENet, which combines the advantages of the channel attention mechanism and spatial attention mechanism. It can serialize the attention feature information in both channel and space dimensions, multiply the two features’ information with the original input feature, and generate the final feature after adaptive feature correction. CBAM is a lightweight module that can be embedded into any backbone network to improve performance, especially in cases where the features of different parts are particularly different. It is designed to enhance the ability of convolutional neural networks to focus on images.
In this section, two attention module embedding schemes are set up to explore the effect of attention mechanism on improving network performance. The schematic diagram of the two schemes is shown in
Figure 8. For scheme (a), the CBAM module is arranged after the first Max Pooling layer of the ResNet structure. For scheme (b), the CBAM module is embedded between the third and the fourth convolutional block of the ResNet structure. The above two different embedding schemes are intended to compare the influence of embedding attention mechanism modules in different locations of ResNet on feature extraction ability.
3.5. Improved Loss Function
Smooth L1 Loss, the traditional Mask R-CNN loss function, assumes that the four points of the boundary box are independent of each other when calculating the boundary box regression loss of target detection and calculates the loss values of the four points, adding them together to obtain the final boundary box regression loss. However, these four points are interrelated, and the indicators used in the actual evaluation are also calculated based on the intersection ratio (IOU) between the predicted border and the actual border, which does not match the calculation method of Smooth L1 Loss; in addition, there are often situations where the loss values of multiple detection boxes are close, but the IOU values are very different. Therefore, in this section, we analyze whether using IOU to design the Loss function is more in line with the idea of boundary box regression, and intent to replace the loss function Smooth L1 Loss in the original Mask R-CNN with the IOU Loss Function.
Common IOU Loss Functions include IOU, GIOU, DIOU, and CIOU, as shown in Equations (12)–(15):
where
A is the area of the real rectangular box,
B is the area of the predicted rectangular box,
C is the area of the intersection area of
A and
B, and
D is the area that does not intersect within the range of the external rectangle of
A and
B.
E is the area of the external rectangle of
A and
B,
d is the Euclidean distance between the center points of
A and
B,
L is the diagonal distance of the external rectangle
E,
α is the weight coefficient, and
ν is the similarity index of the aspect ratio between the predicted box and the real box.
Compared with Smooth L1 Loss, IOU Loss establishes box-based loss calculation, which can directly reflect the comparison effect between the predicted box and the real box. However, when the predicted box and the real box have no intersection area, that is, when C = 0, the Loss function is not distinguishable. Moreover, when the IOU Loss is the same, predicted and real boxes can intersect in various ways. Hence, the IOU Loss neither optimizes the absence of intersecting regions nor reflects the way prediction and real boxes intersect.
GIOU Loss adds the item based on IOU Loss. Considering the influence of disjointed areas in the external rectangular range, it can also carry out learning training even when there is no overlap area between A and B. GIOU has several advantages: (a) GIOU is scale invariant; (b) GIOU is the lower bound of the IOU, where the two boxes overlap indefinitely, IOU = GIOU = 1; (c) GIOU pays attention not only to overlapping areas, but also to other non-overlapping areas, which can better reflect the degree of overlap between two boxes.
DIOU improved the problem of GIOU. As shown in
Figure 9a, the ratio
of Euclidean distance d from the center point of the prediction box and the real box to the diagonal length of the external rectangle is introduced as a penalty term. If the IOU value is the same, this ratio can reflect the intersection between the prediction box and the real box. When the two boxes do not intersect, as shown in
Figure 9b, the IOU value is 0, and the penalty item
is the main optimization object, making the prediction box move to the real box until it coincides. When the two boxes are inclusive, as shown in
Figure 9c, the penalty item
is still the optimization object, guiding the prediction box to move in the direction of the coincidence of the center point with the real box.
The constraint term αν on-aspect ratio is added to CIOU Loss. Compared with DIOU Loss, the convergence speed is somewhat accelerated. However, involving the calculation of inverse trigonometric functions, the optimization speed of the model will be reduced to a certain extent. The author used the Faster RCNN model to conduct experiments on the MS COCO 2017 dataset, and the results showed that the accuracy of CIOU Loss was slightly lower than that of DIOU Loss for detecting small-scale targets.
Therefore, after comprehensively weighing the optimization speed and performance in small target detection tasks, this section uses DIOU Loss to replace the boundary box regression Loss Smooth L1 Loss in Mask R-CNN. Two modules in Mask R-CNN need to calculate boundary box regression Loss , namely the RPN network and ROI Head module. The boundary box regression loss of both modules is calculated through Smooth L1 Loss. This section uses DIOU Loss to replace both Smooth L1 losses.
5. Conclusions
Micro-cracks on the surfaces of metal structures have weak target features and are easily disturbed by complex backgrounds. When using deep convolutional neural network to detect micro-cracks, they are easily missed and falsely detected. Therefore, this paper first makes a surface crack dataset, including metal test samples and metal structural parts of large equipment, and enhances the dataset by adding noise, geometric transformation, and other operations. Then, based on the original Mask R-CNN, this paper explores an improved method for micro-crack detection tasks on the surface of metal structural parts.
In this paper, the network structure of the Mask R-CNN is improved: based on the original FPN structure, a bottom-up feature fusion path is added to improve the information utilization rate of the underlying feature layer. The original ResNet was changed to a hierarchical residual structure to improve the efficiency of extracting features at different scales. A deformable convolution kernel replaces the 3 × 3 convolution kernel in the original residual structure to improve the feature extraction efficiency of the target with small and slender cracks. The CBAM attention module is tentatively embedded in the head, tail, and middle parts of ResNet to increase the expression weight of the micro-crack region in the feature layer. The original Smooth L1 Loss function was replaced with DIOU Loss to optimize the network training effect. Finally, an ablation experiment was conducted to verify the effect of each improvement scheme on the performance improvement of the model.
The performance comparison results of various optimization models show that all the improvement schemes proposed in this paper improved the performance of the original Mask R-CNN, among which the improvement of FPN had the most significant effect on the recognition rate and accuracy of network detection. Replacing the original Smooth L1 Loss with DIOU Loss can significantly improve the convergence effect during network training. The integration of all the improvement schemes can produce the most significant performance improvement effect in the aspects of identification, classification, and positioning at the same time, which proves the rationality and feasibility of the improved scheme in this paper.
On the other hand, the research process of this paper still has some limitations that cannot be ignored. In order to make each image in the micro-crack dataset contain as many crack details as possible, so that the model could extract the features of micro-cracks more efficiently, this paper used a digital SLR camera to capture micro-crack images with higher resolution in the image acquisition stage. However, the issue of how to effectively extract features based on low-resolution micro-crack images and train the detection model to achieve similar performance to the model proposed in this paper (using high-resolution micro-crack images for training) has not been explored.
Furthermore, in order to verify the detection performance of the improved Mask R-CNN trained in this paper for relatively low-resolution micro-crack images, the resolution of the dataset used for training was uniformly reduced from 1920 × 1280 to 640 × 480, and it was imported into the same model for prediction and inference. As the defect feature information contained in the micro-crack target in the image after resolution reduction is reduced synchronously, more micro-crack images appear to be false detection and missing detection (compared with the detection results before image resolution reduction). This comparison result also indicates that the improved method proposed in this paper still has a certain optimization space in terms of the defect feature extraction performance of the model.
At the same time, although the 500 original micro-crack images captured in this paper were enhanced to 2000 images via our data enhancement method, it is still difficult to support and train a micro-crack detection model with excellent performance. Therefore, it is very valuable to explore more efficient methods of enhancing small-scale datasets or model training based on small-scale datasets.