1. Introduction
Highways are key infrastructure, serving as hubs for transportation and playing an irreplaceable role in people’s daily lives. However, as investment in highways and other infrastructure in China increases year by year, many highways in service continue to deteriorate and become damaged, requiring regular monitoring and assessment of their condition. The primary method used in traditional bridge fracture detection is human measurement, with low efficiency, high missed detection rate, long time consumption, and high cost. Additionally, factors such as crack width and length take a while to calculate and process. Thus, automatic and effective fracture detection is crucial for determining the structural health of a bridge.
Early methods of detecting and maintaining the condition of the road surface generally relied on manual inspection, which was not only labor- and time-intensive, but also had a low detection accuracy and some dangers [
1,
2,
3]. Scholars around the world have conducted a series of extensive and in-depth research using the latest scientific and technological developments to accurately and effectively extract crack information from images [
4,
5,
6]. In 2014, Wang et al. [
7] proposed a road surface crack extraction method based on valley bottom boundary; it generates findings for crack detection by applying a number of image processing methods. A crack connection technique for road surfaces was put forth by Liang et al. [
8] in 2015, and it is based on Prim’s minimum spanning tree. The algorithm fills the cracks to create the crack structure. These conventional fracture-detecting techniques have clear drawbacks. Each technique is intended for a particular database or circumstance, but if the dataset or scenario changes, the crack detector will fail.
The threshold segmentation algorithm is one of the most basic methods in crack image segmentation, with the characteristics of less computation, simple operation, and robust performance. This algorithm can segment an image into black and white colors by extracting its grayscale value information. However, when dealing with images with weak contrast, threshold segmentation usually requires contrast enhancement first. In 1992, Kirschke et al. [
9] proposed a road surface crack image segmentation algorithm based on the histogram, but it is only suitable for clear crack recognition. Subsequently, Oh et al. [
10] proposed an iterative threshold segmentation algorithm, but it requires the manual setting of thresholds. Therefore, the threshold segmentation algorithm is suitable for road surface crack images with consistent background texture, uniform illumination, and high contrast.
Some traditional algorithms can be used to implement road crack detection, such as the improved K-Means algorithm proposed by Fang et al. [
11]. Senthan et al. [
12] proposed to use fuzzy Hough transform to detect cracks in road images, taking into account the fact that cracks are composed of nearly straight segments embedded in surfaces with considerable texture. The minimum path selection algorithm proposed by Rabin et al. [
13] realizes automatic crack detection of two-dimensional road images. However, these algorithms require the manual setting and adjustment of parameters and have a strong dependence on human operation. In summary, edge-detection algorithms mainly judge whether it is a crack edge based on local gray and gradient information, which is only suitable for crack images with strong edge information, and tend to misjudge the background with strong edge information as cracks. When the noise is high, the effect of edge detection is poor.
Deep-learning techniques have been extensively applied to the identification and segmentation of road surface cracks in recent years [
14,
15,
16]. By merging deep-learning techniques with road surface crack detection technology [
17,
18,
19], this technology has significantly increased the efficiency and accuracy of road surface crack detection. However, due to the characteristics of high similarity between road surface cracks and background and small and irregular shapes of cracks in reality, accurate identification has always been a challenging problem. Improving the accuracy and timeliness of image crack extraction has become the focus of current research. Early CNN-based crack detection approaches essentially consist of two tasks: finding the detection target’s bounding box and determining its category. These algorithms only identify the smallest bounding rectangle of the target crack in the image. Currently, window-based and region-based object detection methods are the two most often used approaches for finding objects. Window-based detection techniques use a fixed-size window to scan the image and a classifier to categorize each sliding window section. Cha et al. [
20] used a window-based neural network to detect road surface cracks. The outcomes demonstrated that this method could gather crack information more precisely than conventional edge detection methods, although its detection accuracy was influenced by window width and length, which were challenging to measure. An increase in detection accuracy is hampered when the window size is too large and contains an excessive amount of irrelevant information. On the other hand, if the sliding window is too small, there will not be enough crack information in the sliding window region to establish whether or not there are any. This will reduce the detection accuracy.
Methods for detecting objects based on regions create candidate regions that initially employ region proposal approaches, establish regions of interest, and then carry out feature extraction. Due to its superior detection effect, Faster Region CNN (Faster R-CNN) has been utilized numerous times for fracture identification on road surfaces [
21,
22]. However, Maeda et al. [
23] claim that they are only able to identify road fractures and cannot learn the maximum width, length, or area of a geometric feature. Attard et al. [
24,
25] discovered road damage using the most recent object identification techniques, Inception V2 and MobileNet. Research Mask R-CNN was also used to detect road surface cracks with satisfactory results. However, the above methods have limited detection accuracy and cannot achieve pixel-level detection due to irregular crack shapes.
The area detection method can only locate cracks and cannot obtain the geometric dimensions of the cracks. Therefore, in order to achieve this goal, crack segmentation is necessary. Semantic segmentation, which categorizes each pixel to identify whether it belongs to a crack or the background, is one of the potential techniques being studied to increase the accuracy of road surface fracture detection. Zhang, L. et al. [
26] used an improved CNN method that obtained better results compared to other traditional methods. Zhang et al. [
27] adopted another efficient structure called CrackNet which has a strong anti-interference ability and can maintain stable detection results, having strong adaptability and robustness. However, these two methods cannot segment small cracks well because the traditional CNN has limitations in fine image segmentation; thus, a fully convolutional network [
28,
29] gradually began to be applied to road surface crack detection. In contrast to CNN, fully convolutional networks (FCN) may accept inputs of any size and utilize deconvolution to upsample the most recent feature map and restore it to the original input image’s size, enabling prediction for each individual pixel.
More and more encoder–decoder model frameworks are being applied to CNN as people strive for improved detection accuracy. The decoder is a network used to gradually restore feature information, while the encoder is a classification network used to extract input features. Bang et al. [
30] used a new encoder–decoder network with more layers and deeper network structure for the pixel-level identification of urban road surface cracks, and it is capable of identifying flaws in black box photos. For the quantification and detection of cracks, Ji et al. [
31] utilized an integrated approach based on DeepLabV3+. Two-step convolutional networks were employed by Chun et al. [
32] and J. Liu et al. [
33] to first recognize and locate cracks before segmenting them. The most common CNN among the numerous encoder–decoder networks is named U-Net, according to Ronneberger et al. [
34]. Its outstanding performance on medical photos drew the attention of numerous scholars for further study. Researchers in the field of civil engineering started using U-Net to find structural road surface cracks since the size and form of medical cells and cracks in the road surface differ. Although the U-Net network performs well in the field of crack detection, considering that future crack detection will be fully automated in real time, its further application will still be hindered by data volume increasing significantly, high computational cost, long training process, etc. [
35,
36]. The introduction of the SOLO model brings new ideas and methods to the field of instance segmentation [
37]. The SOLO model adopts a new segmentation strategy that can more accurately segment targets and obtain high-quality segmentation results. Additionally, the SOLO model adopts a center-point-based target segmentation scheme, which can quickly and efficiently predict and segment instances in images. However, the segmentation effect of the SOLO model on small targets is not good because the SOLO model is an instance location-based segmentation algorithm, so there may be certain limitations for objects of different sizes. Especially for smaller objects, the SOLO model may not be able to capture enough location information, resulting in poor segmentation effect. The SOLO model was subsequently upgraded to the SOLOv2 model [
38], introducing some new technologies to improve the model’s segmentation performance and efficiency, such as using a distributed head network, mask feature pyramid, etc. However, SOLOv2 still has some limitations.
SparseInst is a real-time instance segmentation framework based on sparse instance activation maps [
39]. Compared with traditional image-based or deep-learning-based methods, SparseInst has the following advantages: it adopts a sparse feature-based design that can significantly reduce computation and storage space without losing information, improving algorithm efficiency and robustness; relative to image-based methods, it has faster speed and higher accuracy in crack detection tasks. However, for small-sized cracks, SparseInst’s segmentation effect is still not good, and under certain conditions, such as low resolution, uneven illumination, and large noise interference, the detection effect will be greatly affected.
Therefore, in view of the problems of low detection accuracy and easy background interference in related work, this paper improves the original SparseInst network in Chapter 2 by adding CBAM module [
40], DCNv2 convolution [
41] and introducing SPM stripe pooling structure and MPM hybrid pooling structure [
42] to adaptively highlight object information areas, improve detection accuracy, and achieve the real-time accurate detection of cracks. In Chapter 3, the SparseInst-CDSM algorithm is used to extract clear crack images, extract the central axis skeleton of the crack through the central axis method, calculate its pixel length and width, and then use the formula to convert the pixel size into actual physical size according to current standards to judge whether cracks need maintenance, reducing labor costs and greatly improving work efficiency. Chapter 4 analyzes the experimental environment and experimental results, verifying the feasibility of the algorithm proposed in this paper.
2. Methodology
The algorithm in this paper is based on the SparseInst network and improves the framework to achieve crack morphology segmentation and extraction. As shown in
Figure 1, the model mainly includes three components: the backbone, the encoder, and the IAM-based decoder. Given an input image, the backbone extracts multi-scale image features (i.e., C3, C4, and C5).
The original SparseInst encoder uses a pyramid pooling module (PPM), and this paper replaces PPM with SPM to avoid establishing unnecessary connections between distant positions. In
Figure 1, ‘4×’ or ‘2×’ indicates upsampling by a factor of 4 or 2. The IAM-based decoder consists of two branches, namely the instance branch and the mask branch. In the instance branch, the IAM module predicts instance activation maps (as shown in the right column) to obtain instance features {zi}N for identification and mask kernels. The mask branch aims to provide mask features M and multiply them with the predicted kernels to generate segmented masks.
2.1. Attention Mechanism Module
In order to enhance the feature extraction capability of the SparseInst backbone network, this paper adds the CBAM module, as shown in
Figure 2, without destroying the structure of the feature extraction network.
This module first performs the maximum pooling process of the encoding part in the channel attention module, as shown in
Figure 3a. The size of the input feature map is
H ×
W ×
C, where
H and
W are the height and width of the feature map, respectively, and
C is the number of channels of the feature map. Here, this paper first uses two pooling methods, MaxPool and AvgPool, to obtain two 1 × 1 ×
C feature maps, denoted as
M and
A, respectively. In order to obtain a weight coefficient between 0 and 1, the sigmoid function is used to combine the two feature maps,
M and
A, into two completely connected layers. This weight coefficient will be used to adjust the weight of the input feature map. The final output feature map is obtained by multiplying the weight coefficient with the input feature map. For the channel attention module, use the following formula:
In this case, the shared MLP module in the channel attention module is represented by the MLP (Multilayer Perceptron). This module first compresses the number of channels before expanding them back to their original number. The Relu activation function yields the outcome of two activations.
After the channel attention module, the spatial attention module is introduced, and it should concentrate on which area of the space has more significant features, as shown in
Figure 3b. Its input size is
H ×
W ×
C. One channel dimension is subjected to max pooling and average pooling to produce two
H ×
W × 1 feature maps, which are then combined in the channel. Next, the two feature maps are concatenated in the channel dimension, and
H ×
W × 2 is the current feature map. Then, after passing through a convolutional layer, it is restored to one channel with a convolution kernel of 7 × 7. The final feature map, while maintaining HW constant, is
H ×
W × 1. The sigmoid function is multiplied by the input feature map to produce the final feature map. The following equation describes the spatial attention module:
2.2. Deformable Convolution Module
When traditional CNN modules are used for visual recognition, there are some defects in fixed geometric structures that are difficult to avoid. Convolutional units, for instance, can only sample the input feature map at specific locations, whereas pooling layers diminish spatial resolution at a specific ratio. An ROI (Region of Interest) pooling layer also divides a ROI into fixed spatial units, and is devoid of an inherent mechanism for dealing with geometric alterations. These defects limit the efficiency and accuracy of traditional CNN modules in handling geometric transformations.
Therefore, adding deformable convolution can solve these problems to some extent because it can make different positions have different receptive field sizes and shapes to better adapt to the diversity of objects. In this way, the accuracy and robustness of convolutional neural networks can be improved so as to better handle various practical problems. This paper adds an offset to the traditional convolution operation of SparseInst, as shown in
Figure 4. It is this offset that makes the convolution deform into an irregular convolution. This offset can be a decimal and needs to be calculated using the method of bilinear interpolation.
Deformable convolution DCN v1 will introduce some irrelevant areas and interfere with the extraction of features, reducing the performance of the algorithm; therefore, this paper adopts DCN v2. In DCN v2, not only is the offset of each sampling point added, but also a weight coefficient is added to distinguish whether the introduced area is an area of interest. If the area of this sampling point is not an area of interest, the weight is assumed to be 0. The formula is as follows:
2.3. Improved Context Encoder
In order to achieve faster inference speed, the original SparseInst used single-layer prediction. Considering the limitations of the single-layer features of various scale objects, the original SparseInst reconstructed the feature pyramid network and proposed an instance context encoder, as shown in
Figure 5. In order to increase the receptive field and fuse features from P3 to P5, the instance context encoder employs the pyramid pooling module PPM [
43] after C5. This improves the output single-stage features’ multi-scale representation even further.
However, the ability of PPM to utilize contextual information is limited because only square kernel shapes are applied. In addition, PPM is only modularized above the backbone network, so it cannot be flexibly or directly applied to the network building blocks of feature learning, and PPM heavily relies on standard spatial pooling operations, which can actually lose important information features in specific scenarios. In response to the problems with the PPM feature pyramid, this paper uses the strip pooling module (SPM) instead of the PPM module, as shown in
Figure 1. To gather remote context from several spatial dimensions, SPM employs horizontal and vertical strip pooling procedures. Establishing long-range dependencies between discretely distributed regions and encoding regions in a strip shape using extended kernel shapes is simple when horizontal and vertical strip pooling layers are present.
At the same time, because the kernel shape of this algorithm is narrow in other dimensions, it emphasizes documenting regional specifics. These features distinguish the suggested SPM from conventional spatial pooling, which uses square kernels. Let the input a two-dimensional tensor, and in strip pooling, the pooling window is (
H, 1) or (1,
W). The biggest difference between strip pooling and average pooling is reflected here. Strip pooling averages all feature values in rows or columns. The formula is as follows:
Compared with the PPM module, the SPM module considers a narrow range instead of the entire feature map, avoiding unnecessary connections between distant positions, as shown in
Figure 6 below. Let
be the input tensor, where
C represents the number of channels; where
x is first fed into two parallel channels, each of which contains a horizontal or vertical bar pooling layer; and where, after being modulated by a one-dimensional convolutional layer with a kernel size of 3, the current position and its surrounding features are determined. Define
and
; then,
can be expressed as:
Multiplication of elements is represented by Scale(.,.), represents the sigmoid function, and f represents 1 × 1 convolution. In order to improve the performance and efficiency of the model, the inner product of feature vectors is combined and calculated to make the model more lightweight. In addition, SPM is used to encode the horizontal and vertical information of the image and balance the weights of different parts to optimize the features. This method greatly improves the global context information collection ability of the model, thus improving its performance.
2.4. Add Mix Pooling Module
In order to increase the distinctiveness of feature representations, the MPM module focuses on collecting multiple types of context information through various pooling methods. It is composed of two sub-modules: standard pooling and stripe pooling, as shown in
Figure 7. Stripe pooling makes it possible to connect regions that are discretely distributed throughout the scene and encode areas with band-like structures, as shown in
Figure 7a. However, for cases where semantic regions are closely distributed, spatial pooling is also required to capture local context information. With this in mind, as shown in
Figure 7b, a lightweight pyramid pool sub-module is used to collect short-distance dependencies. It starts with two spatial pooling layers and then moves on to a convolutional layer for extracting multi-scale features and a 2D convolutional layer for retaining the original spatial information. The size of the merged feature maps after each merge is 20 × 20 and 12 × 12, respectively. All three sub-paths are then merged by summation. These two sub-modules can simultaneously capture short- and long-range dependencies between different positions and are essential for scene parsing networks.
Since MPM is modularly designed, it can be directly built on the backbone network, as shown in
Figure 1. Since the output of the backbone network is 2048 channels, a 1 × 1 convolutional layer is first connected to the backbone network to reduce the output channels from 2048 to 1024; then, two MPMs are added. Each MPM uses 256 channels (i.e., a 1/4 reduction rate) for all convolutional layers with kernel sizes of 3 × 3 or 3. To forecast the segmentation map, a convolutional layer is finally implemented.
2.5. Preventing the Problem of Overfitting
Overfitting is a common problem in machine learning and statistical modeling, where the model overfits the training data, reducing its generalization ability. The model performs well on the training dataset but cannot generalize well on new datasets. The model focuses too much on the details and noise of the training data, ignoring the true characteristics of the data distribution, resulting in poor performance on new unseen data. Overfitted models are also more sensitive to outliers, noise, or errors in the input data. This means that even in the presence of slight interference or errors, the model may produce unreliable predictions. To address the problem of overfitting, this paper uses the following methods to avoid overfitting:
- (1)
Data Augmentation: By randomly transforming and augmenting the training data, the diversity of the training data can be increased. This can effectively reduce overfitting and improve the model’s generalization to new images. The data augmentation operations used in this paper include random cropping, rotation, scaling, and flipping.
- (2)
Regularization: Regularization is a method of limiting the complexity of the model by introducing a regularization term into the loss function. Common regularization methods include L1 regularization and L2 regularization. In this paper, regularization is used to penalize large weight values in the model, thereby avoiding overfitting.
- (3)
Early Stopping: Early stopping is a simple and effective method to prevent overfitting. It monitors the performance metrics on the validation set and stops training before the model starts to overfit. Generally, when the performance on the validation set no longer improves, it can be considered that the model has reached its best generalization ability.