2.2. Methods
In the preprocessing stage, we resize all images to 224 × 224 × 3 to meet the requirements of the CNN model. We also use this size for the HC feature extraction method to maintain consistency. To improve feature extraction, we apply a median filter to smooth the images. The median filter effectively removes noise while preserving edges, which is necessary for feature extraction [
21].
Feature extraction uses techniques to detect and isolate attributes in an image, like edges, textures, shapes, and colors. These features represent the image in a more compact and informative way, helping with image recognition. This research examines both automated (deep learning) and manual (handcrafted) feature extraction methods [
22] and fuses these features to identify crack and non-crack images.
Figure 2 presents the architecture of the feature extraction process at a glance.
To extract deep features, this research analyzes several CNN models: VGG-16, VGG-19, Inception-V3, and ResNet-50. CNN models automatically extract significant characteristics from images. For both of our datasets, ResNet-50 performs better than the other CNN models in recognizing concrete crack and non-crack surfaces. ResNet-50 is a deep CNN with 50 layers and is part of the ResNet family introduced by [
23]. It addresses the challenge of training deep networks by using residual learning. This method employs shortcut connections to bypass one or more layers, thus mitigating the vanishing gradient problem. These connections are called residual blocks.
Figure 3 shows the structure of a residual block. The architecture of ResNet-50 includes an initial 7 × 7 convolutional layer, followed by four stages of residual blocks. Each block contains three layers of 1 × 1, 3 × 3, and 1 × 1 convolutions. These stages gradually increase the number of filters, from 64 to 2048. They include batch normalization and ReLU activation functions. The network ends with a global average pooling layer and a fully connected layer, producing output through a softmax function. Since this research uses ResNet-50 as a feature extractor, we consider the output from the layer just before the final fully connected layer to obtain features. Specifically, this is the output of the global average pooling layer, which follows the fourth stage of residual blocks. This global average pooling layer condenses the spatial dimensions of the feature maps into a single 2048-dimensional feature vector for each input image.
The manual feature extraction approach is essentially the HC method. To extract HC features, this research analyzes three techniques: wavelet transform [
24], counterlet transform [
25], and curvelet transform [
26]. Each technique is applied to the concrete surface images, and then features are extracted from the transformed images using a gray-level co-occurrence matrix (GLCM) [
27]. For both of our datasets, the curvelet transform performs better than the other HC methods in recognizing concrete crack and non-crack surfaces. Curvelets represent multi-scale image geometric transformation. Curvelets are preferable over other similar techniques since curvelets represent edges, curves, and directionality effectively. This research employs the wrapping-based fast discrete curvelet transform method, as it is the most efficient approach. For an image
f[
x,
y] with height
M and width
N, if
φ[
x,
y] is the curvelet function and
K1 and
K2 are the spatial locations of curvelets, then the general expression for the collection of curvelet coefficients is:
Here,
j is the scale and θ is the orientation. For image
f[
x,
y], there exists
j x θ number of sub-band images [
28]. In our method, we used three scale curvelet transforms with four orientations: 0°, 45°, 90°, and 135°. This means that we have a total of (4 × 3) = 12 sub-band images. GLCM is applied to each of these sub-bands. Then, 13 different features, like contrast, correlation, entropy, etc. [
27], are calculated from the GLCM-applied sub-band images.
Based on the outcomes, we select the best CNN and HC techniques to form the fused model. We merge features from ResNet-50 and the curvelet transform to create the final feature vector. This results in a total of 2204 features for any concrete surface image. Of these, 2048 features come from the ResNet-50 model, and the rest come from the curvelet transform.
Feature optimization enhances model performance by reducing feature redundancy and noise, thus improving accuracy and decreasing computational complexity. This process ensures the model focuses on the most relevant and informative features, leading to better predictions and efficiency. We use two popular feature optimization techniques, PCA and linear discriminant analysis (LDA) [
29], on the fused features. Both techniques enhance performance, with LDA providing the most efficient outcome. LDA reduces dimensionality by finding a new axis that maximizes the separation between different classes. It projects the data onto a lower-dimensional space while maintaining class separability. LDA computes the mean vectors for each class and the overall mean, then calculates the within-class and between-class scatter matrices. It solves an eigenvalue problem to find the linear combinations of features that best separate the classes. The resulting components are ordered by their ability to discriminate between classes, and the top components are used to reduce the dimensionality.
From the final optimized features of a concrete surface image, the eXtreme gradient boosting (XGB) classifier is used to recognize the crack and non-crack status. This research selects XGB due to its superior outcomes after analyzing four different classifiers: XGB [
30], random forest (RF) [
31], adaptive boosting (AdaBoost) [
32], and category boosting (CatBoost) [
33]. The XGB classifier builds an ensemble of decision trees sequentially. Each new tree tries to correct the errors made by the previous trees. It uses a technique called gradient boosting. The algorithm calculates the gradient, which is the difference between the predicted and actual values. The new tree is then trained to minimize this gradient. XGB also uses regularization to prevent overfitting. This means it penalizes more complex models to keep them simple. Additionally, it includes techniques like tree pruning and handling missing values. These features make XGB efficient and accurate in classification tasks.
To make the classification outcome understandable, this research uses two deep explainers: LIME and Grad-CAM++. LIME explains predictions by approximating the model locally with a simpler, interpretable model. When using LIME, it perturbs the input data and observes the changes in the output. It then builds a linear model around the prediction to explain it. LIME provides insights into which features are most important for a specific prediction [
34]. Grad-CAM++ generates heatmaps to show which parts of an image influence the model’s decision. It calculates the gradients of the target class concerning the feature maps. It then combines these gradients to produce a weighted map. This map highlights the important regions in the image for the prediction. Grad-CAM++ improves on Grad-CAM by better handling multiple instances of the target object in an image. It provides more precise and detailed visual explanations [
35]. LIME and Grad-CAM++ help to interpret and visualize how decisions are made in concrete surface crack recognition.
To localize and identify the exact crack region, this research develops an algorithm, which is presented in Algorithm 1. The algorithm begins by converting the input grayscale image
I to a binary image
B using a specified threshold
T. Morphological operations are applied to
B to enhance crack regions. Contours are then detected in the processed binary image
M. For each contour, its convex hull
Hi is computed. A mask
H is generated to isolate crack regions by combining all convex hulls. This mask
H is used to extract crack regions
R from the original image
I. To compute the convex hull from contours, Graham’s scan [
36] method is used. Graham’s scan is the most popular technique for convex hulls. It begins by selecting the point with the lowest y-coordinate (and the leftmost if tied) as the starting point. It then sorts all other points based on their polar angle relative to this point. Using a stack, it iteratively adds points to form the convex hull, ensuring that each new point does not create a clockwise turn with the last two points on the stack until all points are processed. In Algorithm 1, the percentage of the image area covered by cracks (
Pcrack) is calculated by comparing the area of
H to the total area of
I. Finally,
Pcrack is returned as the output, providing a quantitative measure of crack presence in the image.
Algorithm 1. Crack region isolation using convex hulls |
Input: |
| |
| |
| |
Output: |
| |
Start | |
| 1. | Convert Image to Grayscale: Let G represent the grayscale image obtained from I. |
| 2. | Apply Binary Thresholding: Define a binary image B where each pixel is set to- |
| | |
| 3. | Enhance Cracks with Morphological Operations: Use morphological operations on B to refine the crack regions. Perform closing and opening operations to smooth and fill gaps in the crack regions. |
| 4. | Detect Contours: Identify contours {C1, C2, …, Cn} in the refined binary image M. |
| 5. | Calculate Convex Hulls: For each contour Ci, calculate its convex hull Hi |
| | |
| 6. | Create Convex Hull Mask: Create a mask H where each pixel belongs to one or more convex hulls. |
| | |
| 7. | Isolate Crack Regions: Generate an image R by masking I with H. So, R now contains only the crack regions isolated from the original image I. |
| | |
| 8. | Calculate Crack Percentage: Determine the percentage of the image area covered by cracks- |
| | |
| 9. | Output: Return Pcrack, representing the percentage of the image area covered by cracks. |
End | |