1. Introduction
As they address the most typical surface defects of bridges, crack detection and repair are the keys to bridge maintenance. In the context of increasing pressure on bridge management and maintenance, the traditional manual inspection is no longer applicable; its drawbacks include a high leakage rate, low efficiency and high labor cost. Therefore, an efficient, accurate and economical solution is urgently needed to replace the traditional manual inspection. Digital-image-based nondestructive crack detection technology at first had the hope of replacing the traditional manual inspection. But it still cannot completely replace the manual inspection, because it is highly susceptible to noise and lacks the ability to extract deep semantic information from images. Additionally, its accuracy is far from meeting the actual engineering requirements in the cases of cracks with complex and diverse dimensional characteristics and noisy detection environments. A study by Dong et al. [
1] pointed out that, although traditional image processing techniques and machine learning can identify various types of structural damage from images, the processing of image data is still time-consuming and prone to a high number of false positive detection results. Ibrahim et al. [
2] reported that CNN outperformed SVM and KNN in terms of accuracy for damage detection for evaluating the health condition of two simulated four- and eight-story building structures subjected to earthquakes. Hou et al. [
3] pointed out that manual recognition methods and traditional machine vision methods are inefficient. Training samples, as well as specific information about the defects, such as contour and location information, are not fully available in different environments.
The rapid development of deep learning in recent years has provided new ideas for intelligent health monitoring of bridges. Deep learning has been widely used in various fields of computer vision, and its biggest advantage over traditional digital image methods is its ability to extract deep features of images. Consequently, deep learning can resist the noise of various complex detection environments and significantly improve detection accuracy after effective training. Yeum et al. [
4] integrated SfM technique and a pre-trained CNN model to localize and classify the ROI of the same full-size road sign structure, which showed better performance than previous methods. Narazaki et al. [
5] used FCN and recursive structure (long- and short-term memory, LSTM) to automatically segment and label the different structural components of a bridge structure and work with synthetic video datasets. The above study demonstrates that the application of deep-learning-based computer vision techniques for bridge crack detection can effectively address the shortcomings of traditional digital image methods and computer vision techniques. Combined with the use of UAV technology, deep learning can achieve nondestructive data collection of the bridge components’ defects. Processing the data collected by the UAV using deep learning can achieve automated crack identification. More importantly, deep learning can be combined with the GPS positioning module and the sensor information of the UAV to achieve accurate localization and dimension assessment of the bridge cracks’ defects.
The application of deep learning techniques in the field of crack detection is divided into object detection and segmentation. Object detection of cracks is performed in the form of rectangular boxes (bbox) used to locate cracks in a given image. The semantic segmentation indicates the category to which each pixel of the original image belongs. The output is a mask that reveals the category to which each pixel belongs. The instance segmentation distinguishes and counts the detected objects based on the semantic segmentation, and outputs a rectangular box (bbox) with the same meaning as the object detection, in addition to the mask. The core idea of the above tasks is to continuously extract the feature information of a given image through various network structures, which is, essentially, to perform matrix operations at multiple levels and finally obtain the coordinate data of bbox or mask data. The widely used algorithmic architectures for object detection include YOLO [
6] series, Faster RCNN [
7] and DETR [
8]; semantic segmentation algorithmic architectures include Unet [
9], DeepLab [
10], etc.; instance segmentation algorithmic architectures include Mask RCNN [
11], Yolact [
12], etc. The network models used by different algorithms for extracting features are Convolutional Neural Networks [
13] (CNN) and Transformer [
14].
Therefore, many scholars have used the above-mentioned deep learning techniques to achieve automated detection of bridge defects. Xu et al. [
15] constructed an end-to-end and embedded ASPP structure convolutional neural network to achieve object detection of bridge cracks, and the F1 score of the model reached 87.71%. Shim et al. [
16] generated multiscale feature maps in the encoder of a traditional semantic segmentation network and used adversarial learning to complete the semantic segmentation of bridge cracks, achieving an f1 score of 88.936%. Liu et al. [
17] proposed APLCNet, based on Mask RCNN, for sidewalk crack detection, with an F1 score of 93.53% on a CFD dataset [
18]. Ren et al. [
19] proposed CrackSeg for segmentation of sidewalk cracks. In the same year, Jang et al. [
20] improved CrackSeg and applied it to the detection of bridge pier cracks; its accuracy for crack segmentation reached 90.92% and its recall rate reached 97.47%. Liu et al. also improved CrackSeg, using the ResNeXt [
21] backbone network, and proposed PCSN for bridge crack segmentation. The accuracy reached 83%. In addition, research on the interpretability of deep learning in bridge defect detection has also made new progress. Cardellicchio et al. [
22] proposed DL-based methods for identifying defects in reinforced concrete bridges, offering humanly explainable interpretations using XAI techniques such as CAMs. In terms of datasets, Zou et al. [
23] published a high-quality crack dataset called DeepCrack Datasets, and proposed DeepCrack, an end-to-end deep convolutional neural network for automatic crack detection. DeepCrack outperforms state-of-the-art methods, achieving an average F-measure of over 0.87 on three challenging datasets. Yang et al. [
24] made a pavement crack dataset, named CRACK500, with 500 images. And Yang et al. proposed FPHBN, a novel network architecture for automatic pavement crack detection. In terms of crack geometric parameter extraction, Tang et al. [
25] developed a novel method to detect and quantify cracks with high accuracy and efficiency. Their method uses a U-net network and a thinning algorithm to extract the geometric parameters of cracks, which contributes to crack detection and geometric parameter extraction. Kao et al. [
26] first used YOLO to locate the bridge cracks, and then used a digital image method for edge detection in the locating area to achieve the measurement of crack dimension. Teng et al. [
27] used DeepLab to perform sematic segmentation on crack images, which can better describe the key dimension information, such as the length and width of the cracks.
Although the algorithms and datasets for crack segmentation have been initially effective in bridge crack detection, they are mainly limited in the semantic segmentation of cracks, which cannot satisfy the need to distinguish different cracks. As mentioned above, although the improvement of CrackSeg substantially improves the segmentation accuracy of bridge cracks, its semantic segmentation characteristics cannot satisfy the localization of the intersecting longitudinal cracks which mainly affect bridge safety. Liu et al. proposed APLCNet [
17] on the basis of Mask RCNN; although it completes the segmentation and localization of cracks, its dataset rarely involves complex intersecting cracks. Therefore, further research is urgently needed for instance segmentation of complex cracks with interlaced vertical and horizontal dimensions, in terms of both data and algorithms. The CRACK500 [
24] and DeepCrack [
23] datasets are both datasets used for semantic segmentation. In contrast, there is a relative lack of research on crack instance segmentation, especially regarding datasets. Though Piyathilaka et al. [
28] annotated 500 images from other datasets to create a dataset for crack instance segmentation, it did not mention the criteria used to distinguish different cracks at the junction intersection. As can be seen from the published dataset, it simply labels the cracks around the intersection points as distinct objects, which does not satisfy the factual situation. Therefore, the instance segmentation of complex cracks that cross longitudinally and horizontally has yet to be studied in depth.
Based on the above research, we can make the following conclusions: the main difficulties in the adoption of bridge cracks include both data and algorithms. The difficulty at the data level lies in the lack of calibration methods used for describing crack morphology. Due to the characteristics of complex cracks intersecting longitudinally and horizontally, a simple closed graph cannot be used to describe its morphology. Therefore, a new technical standard for crack annotation is urgently needed. The difficulty at the algorithm level lies in the absence of targeted optimization for cracks in the instance segmentation. At the same time, the data and algorithm are mutually influential. The crack-counting standard not only affects the difficulty and accuracy of the crack segmentation dataset, but also influences the post-processing of previous output and management decisions. The lack of an optimized instance segmentation algorithm is contrary to the requirements of subsequent evaluation, especially for the detection of cracks with long and thin shapes. Breakpoints in cracks are one of the embodiments of an unoptimized instance segmentation model, resulting in counting errors and wrong dimension-estimation.
To this end, this paper uses cameras and UAVs to collect high-resolution images of concrete bridge cracks. A criterion for calibration and the counting of concrete cracks under complex conditions, called relative breakpoint annotation, is proposed. The proposed criterion is used to create a high-quality segmentation of concrete bridge crack examples using Photoshop, LabelMe and Eiseg. The dataset was manually calibrated with information on the true size of the crack targets in the dataset. In order to verify the improvement of crack recognition accuracy with this dataset format, several instance-segmentation algorithms with different design architectures (Mask RCNN and Yolact) were trained and tested.
The proposed method aims to propose a normalized crack counting rule and boundary determination rule to standardize the annotation of a crack instance segmentation dataset. Based on the proposed method, the proposed dataset will contain 787 images, including 3794 crack objects, which will greatly improve the accuracy of crack instance segmentation detection at the dataset level. Based on the proposed dataset production method, a set of lightweight crack-recognition methods is implemented according to the proposed dataset production method.
This paper begins by introducing the two classic deep learning algorithms in
Section 2, including Mask RCNN and Yolact, which are employed to validate the proposed crack annotation method. Following this,
Section 3 presents the proposed crack annotation method and a comprehensive analysis in detail. The evaluation work is conducted in
Section 4, followed by a section of concluding remarks.
3. Breakpoint-Based Crack Annotation
This section presents the proposed normalized crack-boundary demarcation and technical criteria, which will be used to produce a standard crack instance segmentation dataset; this method focuses on concrete bridge cracks. We refer to the proposed method as relative-breakpoint crack annotation, and compare this proposed method with two previously abandoned methods. The proposed method standardizes the production of crack instance segmentation datasets, which contributes to the accuracy improvement of crack instance segmentation models from the data level.
The counting criteria for cracks has a great influence on the model’s performance. The most difficult problem of counting and calibration is how to judge the boundary of multiple intersecting cracks. Taking
Figure 3a as an example, different crack annotation methods, as indicated by
Figure 3b–d, can seriously affect the crack-detection model’s training and performance. Therefore, we make a comparative analysis between continuous crack annotation methods, i.e.,
Figure 3b, and the absolute breakpoint annotation in the following subsections, i.e.,
Figure 3c. Based on the results of the analysis, a relative-breakpoint-based annotation strategy, as seen in
Figure 3d, is proposed to delineate crack boundaries to create a high-quality crack instance segmentation dataset.
3.1. Continuous Crack Annotation
Continuous annotation is shown in
Figure 3b, in which all connected cracks in the entirety of the image are considered as a whole and marked as a single object. The advantage of this labelling method is that the standard boundary is simple, and there is no ambiguity and difficulty in determining it.
However, the disadvantages of this approach are obvious. For anchor-based instance-segmentation models (i.e., candidate windows, final local rectangles obtained through anchor adjustment, filtering, etc.), this approach largely depends on the accuracy of the model’s ability to detect and segment large objects. Therefore, this method somewhat renders the detection and segmentation of small objects meaningless. The model uses different feature layers and corresponding anchor points for crack detection and segmentation. The feature layers of different scales are used to detect objects of different scales. As a result, only the proposals and anchor points generated by the large-scale prediction feature layer are useful in back-propagation during model training. By contrast, the proposals generated by other scale-prediction feature layers are classified as negative. This leads to the updating of the weights of each network structure incorrectly during back-propagation because they do contain cracked objects inside. The weights are updated in the wrong direction.
3.2. Absolute Breakpoint Crack Annotation
Absolute breakpoint annotations are defined as broken objects that break at any intersection point. An example of absolute breakpoint annotation is shown in
Figure 3c. Compared with the continuous crack labelling rule, this labelling criterion is conceptually inverse and more sensitive to small objects. The advantage of absolute breakpoint labelling is that the boundary is clear, and there is no situation where the boundary is blurred or difficult to determine.
Similar to continuous annotations, the disadvantages of absolute breakpoint annotations include potentially reducing the performance of the model in detecting and segmenting large objects. But according to our experience and analysis, its disadvantages may not be as obvious as those of continuous labelling. Continuous breakpoint labelling increases the proportion of the proposal rectangle, but the increase in the mask size is much smaller than the increase in the proposal proportion. Since the width of the crack is very small relative to the length, the crack can be approximated as a curve. Compared to absolute breakpoint annotations, the area of proposal rectangles for continuous annotations grows at a quadratic rate, while the area of masks grows at an approximately linear rate. Therefore, for absolute breakpoint annotations, it is suggested that the effective pixel ratio of cracks in rectangles is higher than that of continuous annotations, which facilitates segmentation.
For most instance-segmentation algorithms, segmentation is performed based on the detection results. Generally speaking, the algorithm first completes the task of target detection, and then performs segmentation within the detected target range (i.e., bbox rectangle). Both segmentation and object detection backpropagation will change the weight parameters of each structure of the whole model. Therefore, it is necessary to ensure that the proportion of effective pixels inside the rectangular frame is as high as possible.
3.3. Relative Breakpoint Crack Annotation
According to our analysis, the above two methods do not conform, to a certain extent, to objective laws, but the standards are clear and unambiguous. Combining the advantages of the above two labelling standards, this paper proposes a relative breakpoint labelling method, as indicated in
Figure 3d and
Figure 4.
Relative breakpoint labels still distinguish the boundary of the crack object at the breakpoint, but the breakpoint here is a relative breakpoint and is specific to each crack. The rule is defined as follows:
- ▪
Firstly, there is an order-of-magnitude difference in crack width around the intersection point, i.e., the absolute breakpoint. For the wider crack, the absolute breaking point is not the relative breaking point of the crack, and the crack will not break here to form two cracks. On the contrary, the smaller crack will form a relative breakpoint here and break at this relative breakpoint.
- ▪
Secondly, if the crack widths around the intersection point (i.e., the absolute breaking point) are similar and cannot be distinguished by the first criterion, the crack is judged by whether the line is continuous. For continuous line cracks, absolute breakpoints do not constitute relative breakpoints. However, for cracks with discontinuous lines or sharp changes before and after the absolute breakpoint, a relative breakpoint is formed. In short, cracks with discontinuous lines or sharp changes will break at relative breakpoints.
According to the above two rules, we can judge most of the cracks that cross each other. For example, the red crack shown in
Figure 4 is relatively smaller than the blue crack, so a relative crack is formed at the intersection point, and the red crack breaks at the relative crack point. The blue cracks are much wider, so there is no relative crack formed at the intersection. At the same time, this case also satisfies the second judgment basis, where the lines of the blue cracks at the intersections are more continuous than the lines of the red cracks.
As a result, the red crack should also be broken at the intersection, while the blue crack is kept in a continuous line.
Theoretically, the continuity of a line means the curvature of the line near the breakpoint. The greater the curvature, the greater the probability that the two cracks before and after the breakpoint are different. The advantage of using relative breakpoint annotation is that the distinction of cracks is more in line with mechanics and human logic. In addition, relative breakpoint annotation makes objects of different scales in the dataset more balanced. Therefore, the predicted feature layers of each scale in the model and their generated proposals and anchors can be fully utilized to enhance the robustness of the model.
3.4. Annotation Method Comparative Analysis
Based on previous analysis, we summarize the features of the aforementioned three annotation methods in
Table 1, including the subjective consciousness effect, mechanical rationality and model training efficiency. Since the sample labelling is a human-interacting task, primary judgement should be conducted based on the visual subjective perception of a human. Therefore, an indicator named “human subjective intervention” is proposed to assess the interference of human consciousness in the process of creating samples for model training. Secondly, cracks are the manifestation of structural components under various loading conditions; hence, the labeled crack objects should conform to its mechanical rationality. For instance, two cracks that intersect should not be regarded as one crack because it does not conform to the mechanical analysis. Therefore, the second index proposed and compared is the mechanical rationality of three annotation methods. Thirdly, computation cost is taken into consideration by judging whether the annotation methods are conducive to model training, which is presented as the model training efficiency.
In addition, since the proposed method focuses on the cracks of the concrete bridge beam in the actual engineering detection, the analysis of the three methods will not consider factors other than the actual detection process, and the qualitative analysis is carried out on the normal collected images.
Comparative results are summarized in
Table 1, according to our previous discussion based on the crack features and deep learning model structures. Generally, the continuous crack annotation should be the most objective method, with minimum human intervention effect; however, the high computation cost is expected during the model training with those continuously annotated crack samples. As for the relative breakpoint annotation, it is subjectively influenced by humans, and its annotation results are consistent with both human subjective perception and mechanical analysis results. It also provides relatively balanced feature maps of different scales, which should be conducive to model training convergence. To verify our conjectures, an experiment is then conducted to verify the specific performance in model training of these three annotation methods, as described in
Section 4.
4. Lightweight Crack Detection Model
As mentioned previously, this section will be first to verify the performance of proposed relative-breakpoint-based crack annotation methods. The dataset contains 120 images, and 744 cracks are prepared for crack annotation using the three different mentioned methods, i.e., continuous crack annotation, absolute-breakpoint crack annotation, and relative-breakpoint crack annotation. The Mask RCNN is utilized for model training, and the mAP of bbox is used as an indicator to judge the best annotation method. The mAP of mask is used as a more precise indicator to measure the model after choosing the method. After that, the outperformed crack annotation method is then used to prepare another larger dataset, which contains 787 images for crack detection model training and evaluation, followed by the post-processing procedures to generate the dimensional information of detected cracks. The 787 images in the dataset contain 3794 crack objects, so each image contains more cracks and has high complexity, as the example picture shown in
Figure 5 demonstrates. Thus, the dataset has sufficient complexity to complete the validation of the proposed method. Mask RCNN and Yolact are employed to further validate the accuracy of our method. The reason for selecting these two models is that they represent the two-stage instance segmentation and one-stage instance segmentation approaches, respectively. Validating our proposed method using these two models demonstrates its versatility. The overview procedures can be found in
Figure 5.
4.1. Crack-Annotation Method Evaluation
First, a small trial dataset containing 120 images is prepared to assess these three crack annotation methods, referred as the trial set, as seen in
Figure 6.
Using Mask RCNN for crack detection: The training tests were conducted using an Ubuntu 20.04 operating system, an Intel (R) Xeon (R) W-2133 CPU @ 3.60 GHz, Python as the programming language and 3.7 as the interpreter version. An NVIDIA GeForceRTX2080Ti GPU was used as the accelerator for the model training tests. Model training and testing were accelerated using CUDA and Cudnn. The deep learning framework of choice was Pytorch.
To evaluate the models’ performance, precision and recall are basic indicators, which are calculated as True Positive (TP), False Positive (FP), False Negative (FN) and True Negative (TN), as seen in Equations (1) and (2).
where TP indicates the number of detected objects that are correctly classified, while FP denotes the number of backgrounds in the image that are misidentified as detected objects. Additionally, FN denotes the number of detected objects in the image that are misclassified as backgrounds, and TN denotes the number of backgrounds in the image that are correctly classified as backgrounds.
Average precision, termed AP, is a measure of how well an object detection algorithm performs. To assess the model’s performance, mean average precision (mAP), which means the average precision of different categories, is also taken into consideration. Similarly, average recall, named AR, is a performance metric in object detection that measures the average ratio of correctly detected objects compared to the total number of objects across multiple recall thresholds. In general, the AR, AP and mAP are performance metrics that can be used to evaluate the accuracy of object detection algorithms.
The results are summarized in
Table 2. As can be seen, the results show that relative-breakpoint crack annotation suits the model best and performs best, as indicated by the highest AP or AR values. For instance, in addition to the indicator AP:0.5–0.95, Rel-break has the highest value of all other indicators. Overall, Rel-break has an AP value of 28.4, much higher than Cont’s 12.5 and Abs-break’s 18. In addition, Rel-break has a large advantage over AP and AR for large-scale targets, with a value of 55.8. On the other hand, the AP of the Cont method for large-scale targets is very low, only 19, which indicates that slender and large-scale labels are not suitable for model training. Additionally, Abs-break is slightly better than Rel-break in the index of small-scale objects. This might be due to the fact that, if there are intersection points in small-scale objects, boundary distinction is difficult to carry out because the scale is too small. Thus, Rel-break does not show an advantage in distinguishing different cracks at the intersection points.
Therefore, the outperformance of the proposed relative breakpoint annotation can be demonstrated, as previously analyzed in
Table 1. It was chosen to label more images in order to examine further its feasibilities.
4.2. Dataset Preparation for the Lightweight Model
The relative-larger dataset contains 787 images of cracks with different scales, surroundings, quantities and morphologies.
Instance-segmentation datasets such as COCO use polygon vertexes to describe a line. The problem is that manual calibration cannot reach a number of vertices sufficient to describe a line smoothly with sufficient precision. In order to ensure the smoothness of the crack line, we utilize the lasso tool of Photoshop to annotate the cracks. For instance, as shown in
Figure 7a, Photoshop is capable of describing detailed information about the edges of the crack. The labeled information from Photoshop is saved as a png image of a binary graph, as shown in
Figure 7b. After that, the png is converted into a json file in COCO dataset format using the digital image gradient method. The process is shown in
Figure 7c,d. Since the COCO format uses polygons to approximate the shape contour of the object, the annotation of the map cracking is difficult. As a result, the map cracking is eliminated from the dataset and only the non-map-cracking images are kept. The trained model is expected to characterize the unseen map cracking by learning features from those multiple intersecting single cracks.
4.3. Lightweight Model Training and Validation
Image-enhancing approaches, i.e., flip, rotation, random clipping, Gaussian fuzzy, etc., are applied on the training dataset. The pre-trained weights of the models on the COCO [
35] dataset are transferred to our task. By transferring the pre-trained weights on the COCO dataset, the backbone network only needs fine-tuning, which greatly reduces the requirement for dataset volume in training the rest of the model. In Mask RCNN or Yolact, the anchor is a predefined bounding box of different scales and aspect ratios that are placed at regular intervals on the image to enable the network to detect objects of varying sizes and shapes. Depending on the statistics of the dataset, setting the anchor size to 50, 100, 200, 400 and 800 will cover most of the annotated instances. The length-to-width ratios of anchors are amended to 1/4, 1 and 4, respectively. In order to emphasize the importance of crack alignment, the proportion of the five components among the original loss function is defined as
.
As given in
Table 3, we first evaluated the performance of the model with different intersection exceedance associations with thresholds between 0.5 and 0.95 and intervals of 0.05. Results obtained with a 0.5 threshold are recorded as AP50. Similarly, AP75 represents the average precision when the threshold of IOU corresponds to 0.75. The intersection over union (IOU) is a metric used to evaluate the performance of object detection and image segmentation tasks [
36]. IOU is calculated by dividing the area of intersection between the predicted and ground truth regions by the area of their union. The formula for IOU is as follows:
It can be seen from
Table 3 that the dataset annotated by relative breakpoint can make the model converge and meet the requirements of industry, for both Mask RCNN and Yolact. Several detection results are shown in
Figure 8, which can indicate that the model is able to distinguish the intersecting cracks well and count them effectively. Taking
Figure 8c as an example, crack0 intersects crack4. Since crack0 has a more continuous line shape and a larger width, the front and rear of crack0 at the intersection point are considered as a whole, while crack4 is disconnected at the intersection point. Additionally, the model can satisfy the detection requirements of cracks with different lengths, widths and aspect ratios at the same time. For instance, the cracks shown in
Figure 8c,h have complex intersections and contain different crack scales. In this case, the model can still achieve localization and segmentation well. Moreover,
Figure 8e shows that the model can perform well within weak lighting conditions.
Therefore, the results can prove that when using the proposed relative breakpoint labelling method, only 700 samples are needed, and the model can be trained with satisfactory performance, whether it is Mask RCNN or Yolact.