Binocular Video-Based Automatic Pixel-Level Crack Detection and Quantification Using Deep Convolutional Neural Networks for Concrete Structures

Liu, Liqu; Shen, Bo; Huang, Shuchen; Liu, Runlin; Liao, Weizhang; Wang, Bin; Diao, Shuo

doi:10.3390/buildings15020258

Open AccessArticle

Binocular Video-Based Automatic Pixel-Level Crack Detection and Quantification Using Deep Convolutional Neural Networks for Concrete Structures

by

Liqu Liu

^1,2,

Bo Shen

^1,2,*,

Shuchen Huang

³

,

Runlin Liu

³,

Weizhang Liao

^3,4,

Bin Wang

³ and

Shuo Diao

^1,2

¹

State Key Laboratory of Building Safety and Built Environment, Beijing 100013, China

²

China Academy of Building Research Co., Ltd., Beijing 100013, China

³

Beijing Higher Institution Engineering Research Center of Structural Engineering and New Materials, Beijing University of Civil Engineering and Architecture, Beijing 100044, China

⁴

Beijing Advanced Engineering Research Center for Future Urban Design, Beijing University of Civil Engineering and Architecture, Beijing 100044, China

^*

Author to whom correspondence should be addressed.

Buildings 2025, 15(2), 258; https://doi.org/10.3390/buildings15020258

Submission received: 13 December 2024 / Revised: 10 January 2025 / Accepted: 14 January 2025 / Published: 17 January 2025

(This article belongs to the Special Issue Seismic Performance and Durability of Engineering Structures)

Download

Browse Figures

Versions Notes

Abstract

:

Crack detection and quantification play crucial roles in assessing the condition of concrete structures. Herein, a novel real-time crack detection and quantification method that leverages binocular vision and a lightweight deep learning model is proposed. In this methodology, the proposed method based on the following four modules is adopted: a lightweight classification algorithm, a high-precision segmentation algorithm, a semi-global block matching algorithm (SGBM), and a crack quantification technique. Based on the crack segmentation results, a framework is developed for quantitative analysis of the major geometric parameters, including crack length, crack width, and crack angle of orientation at the pixel level. Results indicate that, by incorporating channel attention and spatial attention mechanisms in the MBConv module, the detection accuracy of the improved EfficientNetV2 increased by 1.6% compared with the original EfficientNetV2. Results indicate that using the proposed quantification method can achieve low quantification errors of 2%, 4.5%, and 4% for the crack length, width, and angle of orientation, respectively. The proposed method can contribute to crack detection and quantification in practical use by being deployed on smart devices.

Keywords:

concrete structures; crack detection; EfficientNetV2; binocular vision; crack quantification; deep learning; structural condition assessment

1. Introduction

As the primary load-bearing systems in modern buildings, concrete structures are extensively employed in bridges, high-rise buildings, dams, tunnels, and various other types of infrastructure. However, these structures are susceptible to cracking due to various factors, including material properties, construction techniques, environmental conditions, and applied loads. Cracks not only compromise the appearance and functionality of the structure but also reduce its bearing capacity, posing significant safety risks. Therefore, investigating crack detection in concrete structures holds substantial practical significance and engineering application value [1,2,3,4,5,6,7,8].

Visual inspection, a prevalent method for crack detection, is highly subjective, frequently resulting in inconsistent and inaccurate outcomes due to human factors. To address the limitations of manual inspection, numerous researchers have proposed various automated methods for crack detection and evaluation. Extensive investigations have been conducted. For example, infrared thermal imaging technology [9,10] has been applied in engineering detection. However, despite enabling automated detection, it still needs manual intervention for processing, making the overall procedure more complex. Ground-penetrating radar technology [11,12] can detect internal crack information in concrete by reflecting electromagnetic signals at crack locations. However, its effectiveness in detecting surface cracks remains limited.

In recent years, advancements in artificial intelligence have significantly enhanced detection efficiency and quality in the construction field [7,8,13,14,15,16,17,18,19,20,21]. Computer vision-based crack detection technology has gradually emerged as a research hotspot due to its automation and real-time processing capabilities. The identification of concrete structures’ cracks primarily employs target detection and image classification methods [22,23,24,25,26]. However, target detection methods often suffer from poor real-time performance due to their complex network architectures and large parameter sizes.

In the case of image classification methods, Chen et al. [27] proposed a deep learning framework (NB-CNN) that integrates convolutional neural networks (CNN) with plain Bayesian data to analyze cracks in single video frames. However, the accuracy is limited due to the shallow number of convolutional layers. Subsequent studies enhance detection by increasing the number of convolutional layers. However, excessive convolutional layers may cause problems like over-fitting and gradient vanishing, ultimately reducing accuracy [28,29].

As a result, researchers have explored various lightweight classification models for crack detection, aiming to reduce computational resource consumption while maintaining high recognition accuracy [30,31]. Traditional classification models require longer prediction times due to their deeper convolutional architectures. Conversely, lightweight classification models reduce computational parameters and inference time due to their simplified convolutional nature. However, their accuracy still requires improvement [32,33,34,35,36].

Quantifying cracks is critical to crack detection [7,17,37,38,39,40,41]. Guo et al. [42] proposed a method to estimate crack width by calculating the crack angle using a cosine function. However, this approach only provides the average width, resulting in discrepancies with localized widths. Majdi et al. [7] developed an automated detection model combining image processing and deep learning techniques to identify cracks in inaccessible areas of concrete structures. By utilizing a Keras classifier combined with Otsu image processing, this method quantifies the end-to-end length of cracks. However, it does not provide access to the overall crack length. Peng et al. [43] introduced an automated detection and quantification method for bridge cracks that quantifies crack width by traversing segmented images and measuring the minimum distance between crack edges. This method incurs substantial quantization errors for cracks with non-uniform widths although it is effective for cracks with relatively uniform widths. Ding et al. [44] employed deep learning combined with unmanned aerial vehicle (UAV) technology to achieve crack detection and quantification without the need for reference markers. They determined the crack length by extracting the crack skeleton. Yang et al. [45] introduced a computer vision-based method to quantify the dimensions of components and cracks by traversing two edge points of a crack and calculating the distance between these points to obtain the crack width. However, this method produces errors in the quantified crack width for certain cracks.

Currently, most crack length quantification methods merely describe the length by calculating the distance between two end points, which poses challenges for subsequent crack propagation analysis. Additionally, these methods often yield quantification errors in crack width, which makes it difficult to meet practical detection requirements. In response to these challenges, this paper develops crack identification and quantification methods for concrete structures to achieve higher precision and accuracy based on enhanced computer vision techniques.

Thus, the novelty of the present study consists of the following aspects:

(1): An enhanced EfficientNetV2 model is developed to detect cracks with high accuracy.
(2): Concrete cracks are quantified in terms of length, maximum width, and propagation angle. A crack feature extraction method based on the U-Net semantic segmentation model is adopted. Based on the crack features extracted from the semantic segmentation model, the geometric dimensions of the cracks are computed by measuring the target object with depth information obtained from a binocular camera.
(3): Crack quantification is used for structural damage assessment. This paper proposes a new skeleton line-based algorithm for solving the maximum width of cracks. The skeleton line-based algorithm adopts the plumb line method based on the median axis to quantify the maximum crack width, which is different from current research on the maximum width of the crack. It is experimentally shown that the skeleton line-based algorithm can be used to solve the problem of maximum crack width.

The remainder of this paper is organized in as follows: In Section 2, the improved EfficientNetV2 architecture and the quantitative approach are described. In Section 3, the results are presented, and the proposed method is fully validated via available experimental results of concrete crack measurements.

2. Methods

2.1. The Overall Process of Crack Detection

This study proposes an inspection process that focuses on detecting centimeter-sized and millimeter-sized cracks, which are critical for assessing a building’s safety after an earthquake. The proposed approach includes an enhanced lightweight classification model that instantly identifies cracks in concrete structures and the quantification of crack characteristics using digital image processing techniques. This improves identification accuracy and precision when measuring maximum crack width. As illustrated in Figure 1, the entire detection procedure consists of three steps.

First, a preliminary qualitative analysis of the cracks is conducted using the improved EfficientNetV2. The video frames captured by the left-eye camera are input into the lightweight crack classification algorithm proposed in this study. Each frame of the video is classified. If the identification result of the frame image contains cracks, this image is classified as a positive sample, and it is selected for subsequent depth calculations. Otherwise, this image is classified as a negative sample, and it is directly excluded from further computations. If the image captured by the left-eye camera is classified as a positive sample, it will be combined with the image from the right-eye camera to calculate depth information. To match feature points in images obtained from the binocular camera with more speed and accuracy, this study adopts the semi-global block matching (SGBM) algorithm [46], which assesses the similarity between disparate pixel regions for the purpose of feature matching. The algorithm achieves an optimal balance between computational efficiency and accuracy, rendering it well suited for scenarios of this study where parallax maps are computed quickly. Subsequently, based on triangulation in combination with known camera focal lengths and baseline distances, the disparity values can be converted into actual physical distances to obtain depth information about the cracks. To improve accuracy, data on different distances from multiple points on the depth map are extracted. Then, they are averaged to obtain a distance towards the plane of the crack.

Secondly, feature extraction is applied to the keyframe images containing crack information. A segmented mask image is generated by separating the cracks from the surrounding area using image segmentation techniques. The Otsu method [47], which achieves image segmentation by comparing pixel intensity with a threshold value, is typically used in traditional segmentation techniques. However, this approach exhibits limited segmentation effectiveness in crack images, and it is highly susceptible to noise. In contrast, the U-Net segmentation network employed in this study learns from a large number of samples, automatically extracts crack features, and realizes crack recognition and extraction on this basis.

Thirdly, based on the algorithms for crack quantification proposed in this study, the geometric features of the cracks, such as crack length, maximum width, and propagation angle, are extracted from the binarized mask image, preparing for automated quantitative assessment without the need for manual measurement.

Figure 1. Flow chart of the whole process of crack detection [48].

2.2. Improved Lightweight Crack Classification and Identification Algorithm

2.2.1. Lightweight Crack Classification and Identification Algorithm

This study considers that when selecting the crack classification algorithm, the model must exhibit fast inference speed to ensure the timely processing of each captured video frame. EfficientNetV2 [49] is a deep neural network classification model that utilizes a compound scaling strategy and optimized convolutional operations, significantly enhancing processing speed while maintaining high accuracy. The EfficientNetV2 network model was selected as the lightweight classification model to implement the proposed detection process. The Fused-MBConv module is adopted instead of the MBConv module in the shallow stage. The Fused-MBConv module replaces the 3 × 3 depthwise convolution and the 1 × 1 convolution in MBConv with a conventional 3 × 3 convolution, as illustrated in Figure 2. The Fused-MBConv module more efficiently leverages server-side accelerators, enabling faster training and inference. For EfficientNetV2, the size of the training and inference images and the number of parameters both decrease compared with those of EfficientNetV1 [50].

The EfficientNetV2 network structure is presented in Table 1, where “Conv 3 × 3” denotes a convolution operation with a kernel size of 3 × 3 and a stride of 2. Numbers like “1, 4” following the Fused-MBConv module represent the output channel expansion ratio of the Expand Conv layer. “k3 × 3” indicates a convolution kernel size of 3 × 3. Similarly, numbers like “4, 6” following the MBConv module indicate the output channel expansion ratio of the Expand Conv layer, where “k3 × 3” represents a convolution kernel size of 3 × 3. “SE 0.25” indicates that the number of nodes in the first fully connected layer within the SE module is one-fourth of the number of feature matrix channels in the MBConv module. Finally, the output is generated through a 1 × 1 convolution, a pooling layer, and a fully connected layer.

Existing research indicates that although the EfficientNetV2 model improves training and inference speed by reducing the number of parameters, its accuracy remains slightly lower than that of the ResNet-RS model [49]. Since the method proposed in this paper relies on the classification results to guide subsequent processing steps, detection accuracy is crucial for enhancing the robustness of the approach. Based on this, this study enhances the MBConv module of the EfficientNetV2 network to improve detection accuracy while maintaining inference speed.

2.2.2. Improved MBConv Module

Experiments on the publicly available complex road crack dataset Crack500 have demonstrated that networks combining dual attention mechanisms (channel attention and spatial attention) achieve higher accuracy compared to traditional deep neural networks. Notably, the spatial attention mechanism significantly enhances crack detection accuracy [51]. The squeeze-and-excitation (SE) channel attention mechanism within the MBConv module of the EfficientNetV2 network consists of two components: squeeze and excitation, as illustrated in Figure 3. This module aims to automatically learn the importance of each feature channel by constructing interdependencies among channels and employing a feature reweighting strategy to enhance relevant features and suppress irrelevant ones.

To capture richer channel information, this study replaces the SE channel attention module in the original model with the CAM channel attention module. This module adopts both global average pooling and global maximum pooling in the feature compression process, as illustrated in Figure 4a.

Additionally, this study introduces the SAM spatial attention module, as depicted in Figure 4b. Since the crack region is typically darker than the background in crack images of concrete structures, it can be more effectively distinguished by contrasting it with the background. Additionally, cracks are typically linearly distributed on concrete structures. The module needs to be weighted so that higher attention is given to crack regions in the spatial dimension. After introducing the spatial attention mechanism, ineffective computations in non-crack regions are reduced.

As shown in Figure 4, the improvement provides a more efficient training and inference process to achieve better balances between performance and resource consumption. Thus, it is particularly suitable for real-time application scenarios.

2.3. High-Precision Crack Segmentation Method

U-Net, which is a convolutional neural network architecture for image segmentation, features an encoder–decoder structure with skip connections to preserve spatial details. To obtain more accurate pixel-level information on cracks, this study adopts a crack feature extraction method based on the U-Net semantic segmentation model. The method conducts pixel-level segmentation of crack images using a backbone feature extraction network combined with a reinforcement feature extraction network. It accurately separates crack regions from background regions. Ultimately, it produces a segmented image for the quantification of cracks in concrete structures. U-Net effectively enhances feature preservation through skip connections, retaining fine spatial information from early layers, which is crucial for accurately detecting elongated cracks and ensuring the continuity and integrity of crack features. Furthermore, U-Net’s symmetric encoder–decoder architecture, combined with skip connections, facilitates efficient context aggregation, enabling the simultaneous capture of both local and global information, thereby improving overall performance, particularly beneficial for detecting elongated cracks. The model architecture is illustrated in Figure 5. U-Net integrates high-level semantic information with low-level details through skip connections to enhance crack detection accuracy significantly.

The U-Net architecture is divided into three components. The first component is the trunk feature extraction section. The backbone section has an input image size of (512 × 512 × 3). The first step and the second step each perform two convolution operations with a kernel size of 3 × 3, producing feature maps of sizes (512 × 512 × 64) and (256 × 256 × 128), respectively. After each step, a 2 × 2 max-pooling operation is applied, resulting in feature maps of sizes (256 × 256 × 64) and (128 × 128 × 128). The third step and the fourth step each perform two convolution operations with a kernel size of 3 × 3, producing feature maps with 256 and 512 channels of sizes (128 × 128 × 256) and (64 × 64 × 512), respectively. Similarly, each feature map undergoes 2×2 max-pooling after convolution, resulting in pooled feature maps of sizes (64 × 64 × 256) and (32 × 32 × 512), respectively. Finally, the fifth step performs two convolution operations with a kernel size of 3 × 3 and 512 channels, producing a feature map of size (32 × 32 × 512).

The second component is the enhanced feature extraction network, which strengthens feature extraction by fusing features from five preliminary layers in the backbone network. The sixth step generates a feature layer of (64 × 64 × 512) through up-sampling, which is channel-fused with the output layer from the fourth step to produce a feature map of (64 × 64 × 1024). The seventh step generates a feature layer of (128 × 128 × 512) through up-sampling, which is fused with the output layer from the third step to produce a feature map of (128 × 128 × 768). The eighth step generates a feature layer of (256 × 256 × 256) through up-sampling, which is fused with the output layer from the second step to produce a feature map of (256 × 256 × 384). The ninth step generates a feature layer of (512 × 512 × 128) through up-sampling, which is fused with the output layer from the first step to produce a feature map of (512 × 512 × 192). Finally, after two convolution operations, a feature map of size (512 × 512 × 64) is generated as the output of this section.

The third component is the prediction network, where a 1 × 1 convolution kernel is used to adjust the number of channels. Since this study involves a binary classification task, the final output feature map size is (512 × 512 × 2), which is used for accurately predicting the crack region.

2.4. Algorithm for Converting Pixel Values to Actual Distances

The crack geometry information extraction algorithm proposed in this study can quantify the crack length, maximum width, and propagation angle. The geometric dimensions of the cracks are computed by measuring the target object with a binocular camera, combined with the crack features extracted from the semantic segmentation model, and the overall system is shown in Figure 6a. The basic principle of binocular camera measurement is similar to the stereo vision of the human eye. Specifically, the triangulation principle of parallel view is adopted, as illustrated in Figure 6b. The target point P_i has the coordinates (Xc, Yc, Zc) in the left-eye camera coordinate system. The projection points on the left and right image planes are p_l (u_l, v_l) and p_r (u_r, v_r), respectively. Since the two image planes are at the same level (v_l = v_r = v), the depth of the spatial point P_i under the camera coordinate system is obtained by the principle of similar triangulation, as illustrated in Equation (1):

Z_{c} = \frac{b * f}{u_{l} - u_{r}}

(1)

where Z_c denotes the target point depth coordinate. b denotes the baseline length, i.e., the binocular camera optical center O_clO_cr spacing. f denotes the focal length of the camera. u_l−u_r denotes the binocular parallax. The crack features are extracted from the semantic segmentation model. Then, the distance between two points in the pixel coordinate system d can be calculated by the geometric feature extraction based on the binarized image.

To convert the pixel dimensions of the cracks in the image into actual physical dimensions, camera parameters must be obtained through binocular camera calibration. The obtained camera parameters, combined with the pixel-to-actual value conversion formula proposed in this study, are used to calculate the actual physical distance D between two points in the camera coordinate system. The coordinate conversion process is illustrated in Figure 6c.

There are rotation and translation transformations between the world coordinate system (O_w − X_wY_wZ_w) and the camera coordinate system (O_c − X_cY_cZ_c). Such coordinate transformations will not change the size of the distance value D in the camera coordinate system. So, there is no need to consider the conversion from the world coordinate system to the camera coordinate system. It is possible to set the world coordinate system and the camera coordinates system to coincide. The transformation from camera coordinate system to image coordinate system (O₁ − xy) is realized by perspective projection, i.e., the process of projecting a certain 3D point P_i (X_c, Y_c, Z_c)) onto a 2D imaging plane p_i (x, y) through a small hole imaging model as shown in Equation (2):

(\begin{matrix} x \\ y \\ 1 \end{matrix}) = \frac{1}{Z_{c}} [\begin{matrix} f & 0 & 0 \\ 0 & f & 0 \\ 0 & 0 & 1 \end{matrix}] (\begin{matrix} X c \\ Y c \\ Z c \end{matrix})

(2)

where x and y denote the coordinates of the points under the image coordinate system. The main point O₁ is the intersection of the optical axis and the image plane. X_c, Y_c, and Z_c denote the coordinates of the points under the camera coordinate system. The main point O_c is in the optical center of the lens. f is the focal length of the camera.

The image coordinate system and the pixel coordinate system (O₂-uv) are both located in the imaging plane. Only the coordinates of the principal point and the unit of measurement differ. The conversion from the image coordinate system to the pixel coordinate system needs to involve converting the physical units on the image to pixel units. The main point of the image coordinate system usually does not coincide with the main point of the pixel coordinate system if we take the coordinates of the main point of the image coordinate system in the pixel coordinate system as (c_x, c_y). Combined with the derivation of Equation (2), the conversion between the pixel coordinate system and the camera coordinate system is established, as illustrated in Equation (3):

(\begin{matrix} u \\ v \\ 1 \end{matrix}) = [\begin{matrix} α & 0 & c_{x} \\ 0 & β & c_{y} \\ 0 & 0 & 1 \end{matrix}] (\begin{matrix} x \\ y \\ 1 \end{matrix}) = \frac{1}{Z_{c}} [\begin{matrix} f_{x} & 0 & c_{x} \\ 0 & f_{y} & c_{y} \\ 0 & 0 & 1 \end{matrix}] (\begin{matrix} X c \\ Y c \\ Z c \end{matrix})

(3)

where u and v denote the coordinates of the pixel points of the image under the pixel coordinate system. The main point O₂ is usually located in the upper left corner of the image. α, β denote the scale transformation factors in the direction of the u-axis and v-axis, which are related to the physical dimensions of the actual sensors. f_x, f_y denote the pixel sizes corresponding to the physical focal lengths f in millimeters in the direction of the u-axis and v-axis, respectively, i.e., f_x = αf, f_y = βf.

In the process of capturing cracks, the camera’s viewing angle is typically maintained at a perpendicular orientation to the crack plane. Under this configuration, the depth values Z_c of each point on the crack plane are approximately equal in the camera coordinate system. To reduce the amount of algorithm calculation, this paper adopts the weak perspective camera model to replace the actual perspective model. Under this method, the distance between each point and the camera optical center when the object is projected is replaced by the average distance of the object. That is to say, the depths of the pixel points in the key frames are replaced by the average depth Z₀, which is obtained by calculating the average of the depths of the binocular camera at multiple target points Z_i. Equation (4) provides an approximate conversion relationship between pixel coordinates and camera coordinates under the weak perspective camera model:

\{\begin{matrix} u = f_{x} \frac{X}{Z_{0}} + c_{x} \\ \begin{array}{l} v = f_{y} \frac{Y}{Z_{0}} + c_{y} \\ Z_{0} = \frac{1}{n} \sum_{i = 1}^{n} Z_{i} \end{array} \end{matrix}

(4)

where X and Y are the point coordinates under the camera coordinate system. Z_i denotes the depth value of each target point. Z₀ denotes the average depth value of the crack plane. n is the number of target points.

The distance between two coordinate points under the pixel coordinate system and the distance between two coordinate points under the camera coordinate system are shown in Equations (5) and (6), respectively:

d = \sqrt{{(u_{2} - u_{1})}^{2} + {(v_{2} - v_{1})}^{2}}

(5)

where d denotes the distance of two coordinate points under the pixel coordinate system, and (v₁, u₁), (v₂, u₂) denote the coordinates of two points under the pixel coordinate system, respectively;

D = \sqrt{{(X_{2} - X_{1})}^{2} + {(Y_{2} - Y_{1})}^{2} + {(Z_{2} - Z_{1})}^{2}}

(6)

where D denotes the distance of two coordinate points under the camera coordinate system. (X₁, Y₁, Z₁), (X₂, Y₂, Z₂) denote the coordinates of two points under the camera coordinate system, respectively. Since the change in the distance in the direction of the depth is not considered, i.e., Z₁ = Z₂ = Z₀, the Equations Z₁ and Z₂ are eliminated in (6).

According to Equations (4)–(6), the conversion formula between pixel distance and actual distance in the crack quantization process can be derived, as shown in Equation (7). Substituting the calibrated camera parameters f_x and f_y into the formula, the actual distances of the two-pixel points of each image frame are obtained, which provides a calculation method for the subsequent quantification of the crack length and maximum width. In this formula, the coordinate transformation is first performed by projecting the line segments under the pixel coordinate system to the u-axis and v-axis directions, respectively. Then, the pixel lengths of the projected lines are converted to the physical dimensions under the camera coordinate system. Finally, the actual physical distances are calculated by the Pythagorean theorem.

D = Z_{0} \sqrt{\frac{p_{x}^{2}}{f_{x}^{2}} + \frac{p_{y}^{2}}{f_{y}^{2}}}

(7)

where p_x and p_y denote the projection lengths of two points in the u, v directions in the pixel coordinate system, i.e., p_x = |u₂ − u₁|, p_y = |v₂ − v₁|, respectively.

2.5. Algorithm for Crack Length Quantification

The skeleton line-based algorithm in crack quantification refers to a method that analyzes and quantifies cracks by extracting the crack’s skeleton line (the central line of the crack). In this study, crack length is defined as the length of the crack skeleton line, calculated by summing the lengths of each pixel along the skeleton. The segmented image is first binarized. The segmented image is a three-channel color image, where the crack region is displayed in red and the background region in black. After binarization, the pixel value of the crack region is set to 0 (black), and the pixel value of the background region is set to 255 (white), as shown in Figure 7.

The crack skeleton line extraction method is implemented through the cv2.distanceTransform function [52] in the OpenCV library for Python. This function adopts the Euclidean distance (cv2.DIST_L2) as the type of distance to compute and a 5 × 5 mask (cv2.DIST_MASK_5) for the distance transformation. This distance change function is a transformation for binary images, where a binary image can be considered to contain both target and background pixels in a two-dimensional space. The distance transform is used to calculate and identify the minimum distance from the target point to the background, which will change the binary image into a grayscale image. The grayscale value of each pixel point is equal to the value of the minimum distance from that target pixel point to the background pixel point, as illustrated by the different numbers in Figure 8. Finally, each row of pixels is traversed to extract the point with the maximum distance value to generate the crack skeleton line. The length of each pixel cell on the skeleton line is determined based on the direction of the pixel cell and then the pixel distance is converted to the actual physical length using Equation (7). According to the development of the pixel units in the horizontal, vertical, and diagonal directions, the actual physical lengths of the three-pixel units

Z_{0} / f_{x}

,

Z_{0} / f_{y}

,

Z_{0} / \sqrt{1 / f_{x}^{2} + 1 / f_{y}^{2}}

can be obtained and the actual physical length of the cracks can be obtained by summing up the three-pixel units.

2.6. Algorithm for Quantifying Maximum Crack Width

The existing research about the maximum width of the crack is mostly based on the minimum distance calculation method of the edge line. In the minimum distance calculation method, all the points on the skeleton line are traversed and the distances from the point to all the edge points of the crack are calculated. Then, twice the minimum distance is the width of the crack at the point.

The specific process of the minimum distance calculation method is as follows: First, set B of crack edge points is generated. The semantically segmented crack images are traversed. If a pixel point is a crack pixel point and the pixel points around the point are not all crack pixel points, the point is a crack edge point, as shown in Figure 9a. The coordinates of the point are stored in set B. Second, leveraging the crack skeleton line S extracted as described in Section 2.6, each point S_j on the skeleton line S is traversed and its distance x to all edge points is calculated, as shown in Figure 9b. The minimum value of all distances x_min is calculated and twice that value is the crack width X of that point. The value is stored in the set T. Finally, the maximum value of all the width values X in the set T is calculated, which is the maximum width value of the crack d_max.

However, this method has certain limitations in terms of quantization accuracy. For cracks with abrupt width changes, similar to those shown in Figure 9c, the quantified maximum width is generally smaller, as illustrated in Figure 9d.

To address the problem, in this study, the crack width quantification uses the plumb line method based on the median axis to quantify the maximum crack width, and the process is as follows: Firstly, the edge points are obtained in the same way as above; the set of edge points is B and a crack skeleton line is proposed to be taken as S. Next, the edge points B_j are traversed and the distances x between the edge points B_j and the points S_j on the skeleton line S are computed, as shown in Figure 10. The minimum value of all the distances x_min, which is the value of the distance between the intersection of the plumb line of the point on the skeleton line and the edge point, is calculated. The minimum value can be multiplied by 2 to obtain the value of the width of the crack based on the plumb line of the median at the point on the skeleton line X. The values in the set T are stored and the value of the maximum width d_max are derived. Then, the coordinates of the two points corresponding to the value of the maximum width are recorded.

2.7. Crack Angle Quantization Algorithm

In this study, a line can be defined from the crack initiation point to the termination point. Then, the angle of this line to the horizontal line is defined as the crack angle, as illustrated in Figure 11.

All pixel points in the segmented crack image need to be iterated and their coordinates can be stored in the set L. The first coordinate in the set L is the coordinate of the starting point of the crack and the last coordinate is the coordinate 0 of the end point of the crack. Based on the coordinates of the start and end points, the angle value of the crack is calculated by the inverse tangent function, as shown in Equation (8):

\{\begin{matrix} x = L [- 1] [0] - L [0] [0] \\ y = L [- 1] [1] - L [0] [1] \\ \tan θ = y / x \\ \arctan θ = math . atan (\tan θ) \\ θ = math . degrees (\arctan θ) \end{matrix}

(8)

where x denotes the length of the crack along the x-axis. y denotes the length of the crack along the y-axis. θ represents the angle between the crack and the horizontal line.

3. Results and Discussion

3.1. Experiments on an Improved Lightweight Crack Classification and Identification Algorithm

3.1.1. Crack Classification Dataset Selection

The classification model in this paper is trained using the Concrete Crack Images for Classification dataset provided by Mendeley Data [53], which is used for the binary classification recognition of cracks and contains 20,000 images of each crack as well as non-crack images. The dataset is divided into training and test sets in the ratio of 9:1, i.e., 36,000 images for the training set and 4000 for the test set, with an image size of 227 × 227 pixels.

3.1.2. Comparison of Hyperparameter Settings

Notably, all deep learning models in this study are trained on a laptop equipped with an NVIDIA GeForce RTX 4060 graphics card and 32 GB of RAM. In deep neural networks, as the number of layers increases, gradient vanishing or gradient explosion often occurs during gradient descent. The model weight initialization is necessary. EfficientNetV2 is pre-trained using weights from the VOC2007 dataset, while the channel-attention and spatial-attention modules are initialized using the Xavier weight initialization method [54]. Training hyperparameters play a crucial role in the model performance of deep neural networks. Different batch sizes, learning rates, and optimizers are compared to assess their impact on prediction accuracy and achievement of optimal network performance. With other parameters held constant, batch sizes of 1, 16, and 32 are compared for prediction accuracy. Both dynamic and static learning rates are tested: the static learning rate is set to 0.01. The dynamic learning rate varies according to a cosine function, as shown in Equation (9). The optimizers SGD and Adam are also compared for prediction accuracy.

l_{f} = \frac{1 + \cos \frac{x π}{50}}{2} (1 - l_{r}) + l_{r}

(9)

where l_f is the learning rate of change, x is the current number of iterations, and l_r is the maximum learning rate.

In this study, the reference index in the crack detection classification model for concrete structures is mainly the accuracy rate, as shown in Equation (10), which is used as a criterion for crack detection. The accuracy rate indicates the number of correctly recognized images divided by the number of images in all datasets.

A c c = \frac{n u m_a c c}{n u m_s a m p l e}

(10)

where num_acc is the number of correctly recognized images, i.e., the network categorized the images correctly. num_sample is the number of all images.

Table 2 details the impact of different hyperparameter choices on the training performance of the crack classification dataset. When the batch size is set to 1, the model converges slowly. However, training time increases significantly when the batch size is too large. Therefore, a batch size of 16 is considered the most appropriate. The dynamic learning rate decreases using a cosine function, gradually converging from 0.01 to 0. Although the static learning rate is set to 0.01 which provides faster convergence, it is less effective at finding the optimal solution. The dynamic learning rate remains high in the early stages of training to accelerate convergence. The dynamic learning rate gradually decreases in the later stages to improve model accuracy, making it easier to find the optimal solution. In this experiment, a significant difference is observed between the training results of the SGD and Adam optimizers. The model trained with the SGD optimizer can achieve higher prediction accuracy, making it the preferred choice for subsequent training.

3.1.3. Ablation Experiment

Ablation experiments are used to validate the impact of key components of the model on overall system performance. Such experiments are performed by gradually removing or modifying specific parts of the model and observing the impact of these changes on training and prediction performance. Validation is carried out through ablation experiments to demonstrate the effectiveness of the improvements made to the EfficientNetV2 model. The EfficientNetV2 network model and the improved EfficientNetV2 network model are utilized for training under the same parameters, respectively. The results of the experiment are evaluated using the accuracy of the test set as the evaluation index, and the test accuracy is shown in Table 3. The average value of the accuracy of the improved EfficientNetV2 model is improved by 1.6% (Top5). Although the inference time increases slightly, the impact on the actual crack detection is small.

3.1.4. Crack Detection Experiment

Crack images of concrete structures’ walls are captured and predicted using the before and after improved versions of the EfficientNetV2 network model, respectively. In the test results generated: “Positive” indicates the presence of cracks. “Negative” indicates the absence of cracks. “Prob:” indicates the predicted probability value for the category. The results in Figure 12a,b show that for the same crack image, the improved EfficientNetV2 model has higher crack prediction probability compared with the original model. The results in Figure 12c,d show that for the more subtle crack images, the improved model can recognize the cracks successfully, while the original model fails to identify the cracks correctly. The results show that the inclusion of the dual attention mechanism in the EfficientNetV2 model enables the model to predict crack images more accurately and the probability value of predicting crack images using the improved EfficientNetV2 model is also higher than that of the original model after comparing it over several trials. Therefore, the enhanced EfficientNetV2 model in this study can increase the accuracy of crack recognition.

3.2. Training Results of High-Precision Crack Segmentation Algorithm

The semantic segmentation model chosen for this study is the U-Net model. A total of 777 images are collected in the dataset. Each image has a resolution of 300 × 300 pixels. The categories of the segmentation model are two categories: background and cracks. The resolution of the image input network is 512 × 512 pixels. In this study, a total of 100 epochs are trained and the training results are shown in Figure 13. The training and testing losses show a smooth decreasing trend and stabilize after 80 epochs. MIoU is an evaluation index used to measure the segmentation effect, as shown in Equation (11). As shown in Figure 13b, the highest value of MioU is 84.3% and it stabilizes after 60 epochs.

MIoU = \frac{1}{k + 1} \sum_{i = 0}^{k} \frac{T P}{F N + F P + T P}

(11)

where k denotes the number of categories. TP denotes a pixel point that is cracked and predicted to be cracked. FP denotes a pixel point that is non-cracked and predicted to be cracked. FN denotes a pixel point that is cracked and predicted to be non-cracked.

As shown in Figure 14, a comparison between the U-Net and Otsu methods is conducted. The U-Net accurately identifies and segments cracks because it analyzes pixels. However, the Otsu method tends to introduce noise during segmentation due to its use of grayscale and thresholding. This can lead to misclassifying bright noise as cracks. A possible explanation is as follows: The U-Net model uses convolution for local feature modeling and combines surrounding pixel values to make judgments, producing a segmented crack image with reduced noise and enhanced performance.

3.3. Actual Value and Pixel Value Conversion Test

3.3.1. Experimental Equipment and Binocular Camera Calibration

The binocular camera captures the images of the left and right cameras at the same time, and it analyzes the parallax of the corresponding points in the two images to obtain the depth information.

The experimental part of this study adopts a USB binocular camera connected to a computer to capture crack images, which facilitates the subsequent quantization of cracks. Then the coordinates of the pixel points of the crack images in the coordinate system of the camera can be calculated based on the two images captured by the binocular camera.

In this study, considering that the crack image can have both faster processing speed and more accurate recognition accuracy in quantification, the capturing resolution of the binocular camera is set to 1280 × 480. The maximum frame rate of the video captured by this camera is 30 FPS. To read the internal and external parameters of the camera, this study implements the Zhang Zhengyou Checkerboard Grid Calibration Method of the Binocular Camera by using MATLAB R2016a [55]. A black and white checkerboard grid of specification GP340 is used. The size of each grid is 25 mm × 25 mm. The binocular camera is used to take pictures of the calibration plate at different positions and angles. A total of 30 calibration images were taken. After eliminating the images with large matching errors, the average calibration error is 0.13 pixels and the binocular camera visualized the calibration process, as shown in Figure 15. The quantization process of the crack detection algorithm proposed in this study needs the video stream captured by the left-eye camera. After the calibration of the left-eye camera, its internal reference matrix parameters f_x, f_y, c_x, and c_y will be used for the subsequent quantization process. The results are shown in Table 4.

3.3.2. Evaluation Results of Pixel Distance and Actual Distance Conversion

To verify the effectiveness of the crack pixel-to-actual distance conversion method described in Section 2.4, tests are conducted using a calibrated binocular camera to capture images of concrete wall cracks in four different scenarios. Five measurement points are selected for each crack and the results are recorded. The test involved predicting crack width values at 20 points and the actual crack width is measured using a crack gauge. The binocular camera calibration results have been given in Section 3.3.1. The average depth information Z₀ of each crack target point when the image of each crack is taken is obtained through the steps of binocular image stereo correction, SGBM stereo matching, and parallax calculation. The pixel width projection values p_xi and p_yi of each point location of each crack in the horizontal and vertical directions are obtained through manual measurement. The predicted width values are calculated by substituting each parameter into Equation (6).

The results of the comparison between the predicted width and the actual measured width are shown in Table 5 and Figure 16. The test results show that the coordinates of the predicted and real crack widths at each point are uniformly distributed on both sides of the y = x function line. The absolute errors are no more than 0.2 pixels, which further proves the effectiveness of the proposed conversion formula.

3.4. Crack Maximum Width Quantitative Test

The maximum crack width values calculated by the crack width based on the edge line minimum distance method are compared with the crack width based on the median-axis plumb line method. The quantification of the maximum width of cracks is evaluated. A total of 20 images of concrete wall cracks collected in the field are used in this experiment. The semantic segmentation of U-Net is adopted to quantify the maximum width of cracks. The results of the quantification are compared in terms of the value of the width of a pixel.

Table 6 shows the maximum absolute error of the maximum width of cracks based on the edge line. As given in Table 6, the maximum absolute error of the maximum crack width based on the minimum distance method is 6 pixels. The average absolute error is 4.4 pixels. The maximum absolute error of the maximum crack width based on the median plumb line method is 2 pixels. The average absolute error is 0.6 pixels, which is a quarter of the error of the maximum crack width based on the minimum distance method of the edge line. Figure 17 shows that the maximum crack widths measured based on the center line and plumb line method are basically in line with the true value of the y = x function line. However, most of the maximum crack widths measured based on the edge line minimum distance method are located below the y = x function line. This indicates that the maximum crack widths based on the edge line minimum distance method are usually smaller than the true values. The results demonstrate that the crack width quantification method based on the median plumb line method proposed in this paper is more accurate in calculating the maximum crack width.

3.5. Integral Crack Quantification Method Test Results

To validate the proposed crack quantization process and evaluate its accuracy, 20 crack images with varying widths are collected for quantization tests on length, maximum width, and crack development angle using the algorithm introduced in Section 2.5, Section 2.6 and Section 2.7. Considering the influence of lighting variations and different material textures on crack segmentation in real-world environments, the test samples included five indoor concrete cracks, five outdoor concrete cracks, five indoor wall cracks, and five outdoor road cracks. The tested crack widths ranged from 0.1 mm to 10 mm, covering the allowable crack widths in most current design codes for reinforced and pre-stressed concrete structures [7]. The training process and results of the deep learning model are detailed in Section 3.1 and Section 3.2. The camera calibration process and results are presented in Section 3.3.1. To ensure image clarity and measurement accuracy, the set distance is maintained near the camera’s focal length. During testing, the binocular camera is fixed 0.5 m away from the target crack to capture images. Detection accuracy can be further improved by applying this method to a specific environment and training the model with crack data from similar settings.

Using the algorithmic crack length quantization method in this paper, the curve length of the crack can be obtained. However, the curve length of the crack is difficult to measure manually in practice. The measurement error is obvious. To ensure that the validation data are more reliable, the actual value of crack length is measured by using the length of the line connecting the first and last points instead of the curve length of the crack. The original crack curve length quantization algorithm considers the influence of the crack direction and the influence of the deflation scale change of the pixel unit in different directions. By simplifying the same idea of crack length quantization, the length of the line between the first and last points of the crack is obtained, as shown in Equation (12):

L_{i} = \frac{Z_{0} d_{i}}{f_{y} \sin θ}

(12)

where L_i denotes the predicted value of the length of the line connecting the first and last points of different cracks obtained by the algorithm of this paper. d_i denotes the projected length of the pixel corresponding to the skeleton line of the crack along the direction of the v-axis of the pixel coordinate system. θ denotes the algorithmic predicted value of the angle of crack development.

To demonstrate the effectiveness of the proposed algorithm through arithmetic examples, Figure 18a–c presents the visualization results of the three finest cracks in the overall quantization test. The relative error percentages for the algorithm-predicted binocular distance, maximum crack width, length between the first and last points, development angle, and other quantization indexes of cracks in Figure 18a–c are quantified and the results are given in Table 7. The relative error percentages for all 20 cracks and the average values of these quantization results are also provided. The test results indicate that almost all 20 crack samples are completely segmented. Their geometric information is accurately extracted. In a small-scale field quantization test of common engineering cracks, the average relative errors for crack length, width, and orientation angle do not exceed 2%, 4.5%, and 1%, respectively. The proposed automatic quantization algorithm can detect cracks with a maximum width of 0.27 mm at 0.5 m from the target surface, achieving an absolute error of only 0.05 mm. The method demonstrates good robustness for both millimeter- and centimeter-scale cracks.

However, this study has several limitations. On the one hand, the quantization module proposed in this study only applies to the quantitative assessment of a single crack and cannot achieve effective quantification in complex multi-branched cracks (including Y-shaped cracks, X-shaped cracks, mesh cracks, etc.). On the other hand, the quantization algorithm in this paper cannot quantify the exact length of a crack due to the discontinuity of the mask map caused by the local blockage of the crack. These limitations restrict its applicability in crack-dense regions. Additionally, although the proposed method has significantly improved crack recognition and quantization accuracy, it lacks real-time performance. The recognition speed is relatively slow due to the high computational load imposed by the precise segmentation and complex quantization steps. For an image with an input size of 640 × 480 pixels, the processing speed is approximately 0.5 s per frame, which is suitable for regular monitoring scenarios, such as critical infrastructures like bridges, dams, and tunnels.

The method is expected to be deployed on smart devices equipped with onboard computing systems for large-scale automated crack detection and quantification. This method shows high potential for engineering applications, improving the efficiency and robustness of crack detection while reducing manual intervention. Although the proposed method has significantly improved crack detection and quantification accuracy, its performance in complex multi-crack scenarios requires further study and optimization. Future work will focus on the identification and quantification of multi-crack regions.

4. Conclusions

The proposed computer vision-based framework significantly enhances the accuracy of crack detection and quantification in concrete structures through improved algorithmic techniques. The key findings from this study include the following:

(1): Enhanced EfficientNetV2 Classification Model: The EfficientNetV2 classification model, known for its low parameter count and fast inference speed, has been improved. By introducing channel attention and spatial attention mechanisms in the MBConv module, the detection accuracy on the crack dataset increases by 1.6% compared to the original EfficientNetV2. In field tests, the model maintains high robustness in detecting fine cracks, outperforming traditional convolutional neural networks overall.
(2): Pixel-to-Physical Distance Conversion Method: A method for converting pixel distances to actual physical distances is proposed. Using a binocular camera, images of 20 cracks from various scenes are captured. Multiple points are selected for crack width prediction and measurement comparison. By utilizing stereo matching and disparity calculations to obtain depth information, this information is substituted into the formula for conversion. The predicted results closely matched the actual measurements because absolute errors in crack width prediction are less than 0.2 pixels. The effectiveness of the proposed conversion formula is validated, providing a reliable foundation for automated crack width measurement.
(3): Maximum Crack Width Quantification Algorithm: A skeleton line-based algorithm for quantifying maximum crack width is developed. Based on 20 cracks in field tests, the error in maximum width quantification using the perpendicular line to the central axis is less than 0.6 pixels, compared to an average error of 4.4 pixels using the traditional minimum distance method. The results demonstrate that the accuracy of the proposed quantification method increases by approximately 85% compared to traditional methods.
(4): Comprehensive Crack Quantification Method: A method considering crack length, maximum width, and propagation angle is developed, achieving high-precision crack quantification. Based on 20 cracks of varying widths, the relative errors between the measured and predicted values are analyzed. The average error in crack length is less than 2%. The average error in maximum width is 4.5%. The error in crack propagation angle is below 1%. The results show that the proposed quantification method accurately quantifies crack geometric features, and it is suitable for various complex environments. The method is demonstrated with excellent robustness and reliability in detecting cracks wider than a millimeter, making it suitable for crack monitoring and health assessment in practical engineering applications.

Author Contributions

Conceptualization, B.W. and B.S.; methodology, B.W.; software, B.W. and S.H.; validation, B.W. and S.H.; investigation, S.D. and B.S.; resources, S.D. and B.S.; data curation, B.W. and S.H.; writing—original draft preparation, S.H.; writing—review and editing, R.L. and B.S.; supervision, L.L. and W.L.; project administration, L.L. and W.L.; funding acquisition, W.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Opening Funds of State Key Laboratory of Building Safety and Built Environment (Grant No. BSBE2023-06) and Core Research on the Fundamental Theory and the Key Technology of Modular-Based Full-Assembly Long-Span Smart Steel Structures (Grant No. 52130809).

Data Availability Statement

The data presented in this study are available upon reasonable request from the corresponding author.

Acknowledgments

The authors sincerely thank Wanying Yuan from Beijing University of Civil Engineering and Architecture for her insightful comments and helpful suggestions during the investigation.

Conflicts of Interest

Authors Liqu Liu, Bo Shen and Shuo Diao were employed by the company China Academy of Building Research Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Feng, Q.; Kong, Q.; Huo, L.; Song, G. Crack detection and leakage monitoring on reinforced concrete pipe. Smart Mater. Struct. 2015, 24, 115020. [Google Scholar] [CrossRef]
Rodríguez, G.; Casas, J.R.; Villaba, S. Cracking assessment in concrete structures by distributed optical fiber. Smart Mater. Struct. 2015, 24, 035005. [Google Scholar] [CrossRef]
Mohan, A.; Poobal, S. Crack detection using image processing: A critical review and analysis. Alex. Eng. J. 2018, 57, 787–798. [Google Scholar] [CrossRef]
Wiggenhauser, H.; Köpp, C.; Timofeev, J.; Azari, H. Controlled creating of cracks in concrete for non-destructive testing. J. Nondestruct. Eval. 2018, 37, 67. [Google Scholar] [CrossRef]
Zhang, Q.; Xiong, Z. Crack detection of reinforced concrete structures based on BOFDA and FBG sensors. Shock Vib. 2018, 2018, 6563537. [Google Scholar] [CrossRef]
Chen, S.; Feng, Z.; Xiao, G.; Chen, X.; Gao, C.; Zhao, M.; Yu, H. Pavement Crack Detection Based on the Improved Swin-Unet Model. Buildings 2024, 14, 1442. [Google Scholar] [CrossRef]
Flah, M.; Suleiman, A.R.; Nehdi, M.L. Classification and quantification of cracks in concrete structures using deep learning image-based techniques. Cem. Concr. Compos. 2020, 114, 103781. [Google Scholar] [CrossRef]
Wu, J.; He, Y.; Xu, C.; Jia, X.; Huang, Y.; Chen, Q.; Huang, C.; Eslamlou, A.D.; Huang, S. Interpretability Analysis of Convolutional Neural Networks for Crack Detection. Buildings 2023, 13, 3095. [Google Scholar] [CrossRef]
Jang, K.; Kim, N.; An, Y.K. Deep learning–based autonomous concrete crack evaluation through hybrid image scanning. Struct. Health Monit. 2019, 18, 1722–1737. [Google Scholar] [CrossRef]
Coleman, Z.W.; Schindler, A.K. Investigation of ground-penetrating radar, impact echo, and infrared thermography methods to detect defects in concrete bridge decks. Transp. Res. Rec. 2022, 03611981221101027. [Google Scholar] [CrossRef]
Torbaghan, M.E.; Li, W.; Metje, N.; Burrow, M.; Chapman, D.N.; Rogers, C.D. Automated detection of cracks in roads using ground penetrating radar. J. Appl. Geophys. 2020, 179, 104118. [Google Scholar] [CrossRef]
Li, S.; Gu, X.; Xu, X.; Xu, D.; Zhang, T.; Liu, Z.; Dong, Q. Detection of concealed cracks from ground penetrating radar images based on deep learning algorithm. Constr. Build. Mater. 2021, 273, 121949. [Google Scholar] [CrossRef]
Chen, D.; Huang, B.; Kang, F. A review of detection technologies for underwater cracks on concrete dam surfaces. Appl. Sci. 2023, 13, 3564. [Google Scholar] [CrossRef]
Arbaoui, A.; Ouahabi, A.; Jacques, S.; Hamiane, M. Concrete cracks detection and monitoring using deep learning-based multiresolution analysis. Electronics 2021, 10, 1772. [Google Scholar] [CrossRef]
Bhowmick, S.; Nagarajaiah, S.; Veeraraghavan, A. Vision and deep learning-based algorithms to detect and quantify cracks on concrete surfaces from UAV videos. Sensors 2020, 20, 6299. [Google Scholar] [CrossRef]
Dorafshan, S.; Thomas, R.J.; Maguire, M. Comparison of deep convolutional neural networks and edge detectors for image-based crack detection in concrete. Constr. Build. Mater. 2018, 186, 1031–1045. [Google Scholar] [CrossRef]
Song, L.; Sun, H.; Liu, J.; Yu, Z.; Cui, C. Automatic segmentation and quantification of global cracks in concrete structures based on deep learning. Measurement 2022, 199, 111550. [Google Scholar] [CrossRef]
Kim, J.J.; Kim, A.R.; Lee, S.W. Artificial neural network-based automated crack detection and analysis for the inspection of concrete structures. Appl. Sci. 2020, 10, 8105. [Google Scholar] [CrossRef]
Fan, J.; Chen, Y.; Zheng, L. Artificial Intelligence for Routine Heritage Monitoring and Sustainable Planning of the Conservation of Historic Districts: A Case Study on Fujian Earthen Houses (Tulou). Buildings 2024, 14, 1915. [Google Scholar] [CrossRef]
Maceika, A.; Bugajev, A.; Šostak, O.R. Evaluating Modular House Construction Projects: A Delphi Method Enhanced by Conversational AI. Buildings 2024, 14, 1696. [Google Scholar] [CrossRef]
Park, S.E.; Eem, S.H.; Jeon, H. Concrete crack detection and quantification using deep learning and structured light. Constr. Build. Mater. 2020, 252, 119096. [Google Scholar] [CrossRef]
Zhang, J.; Qian, S.; Tan, C. Automated bridge surface crack detection and segmentation using computer vision-based deep learning model. Eng. Appl. Artif. Intell. 2022, 115, 105225. [Google Scholar] [CrossRef]
Yong, Y.P.; Lee, S.J.; Chang, Y.H.; Lee, K.H.; Kwon, S.W.; Cho, C.S.; Chung, S.W. Object detection and distance measurement algorithm for collision avoidance of precast concrete installation during crane lifting process. Buildings 2023, 13, 2551. [Google Scholar] [CrossRef]
Razveeva, I.; Kozhakin, A.; Beskopylny, A.N.; Stel’makh, S.A.; Shcherban’, E.M.; Artamonov, S.; Pembek, A.; Dingrodiya, H. Analysis of Geometric Characteristics of Cracks and Delamination in Aerated Concrete Products Using Convolutional Neural Networks. Buildings 2023, 13, 3014. [Google Scholar] [CrossRef]
Lee, S.Y.; Jeon, J.S.; Le TH, M. Feasibility of Automated Black Ice Segmentation in Various Climate Conditions Using Deep Learning. Buildings 2023, 13, 767. [Google Scholar] [CrossRef]
Zou, D.; Zhang, M.; Bai, Z.; Liu, T.; Zhou, A.; Wang, X.; Cui, W.; Zhang, S. Multicategory damage detection and safety assessment of post-earthquake reinforced concrete structures using deep learning. Comput.-Aided Civ. Infrastruct. Eng. 2022, 37, 1188–1204. [Google Scholar] [CrossRef]
Chen, F.C.; Jahanshahi, M.R. NB-CNN: Deep learning-based crack detection using convolutional neural network and Naïve Bayes data fusion. IEEE Trans. Ind. Electron. 2017, 65, 4392–4400. [Google Scholar] [CrossRef]
Iraniparast, M.; Ranjbar, S.; Rahai, M.; Nejad, F.M. Surface concrete cracks detection and segmentation using transfer learning and multi-resolution image processing. Structures. Elsevier 2023, 54, 386–398. [Google Scholar] [CrossRef]
Pauly, L.; Hogg, D.; Fuentes, R.; Peel, H.; Luo, S. Deeper networks for pavement crack detection. In Proceedings of the 34th ISARC, IAARC 2017, Taipei, Taiwan, 28 June–1 July 2017; pp. 479–485. [Google Scholar]
Han, X.; Zhao, Z.; Chen, L.; Hu, X.; Tian, Y.; Zhai, C.; Wang, L.; Huang, X. Structural damage-causing concrete cracking detection based on a deep-learning method. Constr. Build. Mater. 2022, 337, 127562. [Google Scholar] [CrossRef]
Xu, G.; Liao, C.; Chen, J. Information extraction of apparent cracks in concrete based on HU-ResNet. Comput. Eng. 2020, 46, 279–285. [Google Scholar]
Habib, M.A.; Hasan, M.J.; Kim, J.M. A lightweight deep learning-based approach for concrete crack characterization using acoustic emission signals. IEEE Access 2021, 9, 104029–104050. [Google Scholar] [CrossRef]
Meng, A.; Zhang, X.; Yu, X.; Jia, L.; Sun, Z.; Guo, L.; Yang, H. Investigation on lightweight identification method for pavement cracks. Constr. Build. Mater. 2024, 447, 138017. [Google Scholar] [CrossRef]
Deng, J.; Lu, Y.; Lee VC, S. A hybrid lightweight encoder-decoder network for automatic bridge crack assessment with real-world interference. Measurement 2023, 216, 112892. [Google Scholar] [CrossRef]
Liu, X.; Sun, Y.; Wu, T.; Liu, Y. Flexural cracks in steel fiber-reinforced lightweight aggregate concrete beams reinforced with FRP bars. Compos. Struct. 2020, 253, 112752. [Google Scholar] [CrossRef]
Kim, J.; Shim, S.; Cha, Y.; Cho, G.-C. Lightweight pixel-wise segmentation for efficient concrete crack detection using hierarchical convolutional neural network. Smart Mater. Struct. 2021, 30, 045023. [Google Scholar] [CrossRef]
Hu, B.; Wu, Y.F. Quantification of shear cracking in reinforced concrete beams. Eng. Struct. 2017, 147, 666–678. [Google Scholar] [CrossRef]
Bazrafshan, P.; On, T.; Basereh, S.; Okumus, P.; Ebrahimkhanlou, A. A graph-based method for quantifying crack patterns on reinforced concrete shear walls. Comput.-Aided Civ. Infrastruct. Eng. 2024, 39, 498–517. [Google Scholar] [CrossRef]
Erdem, S.; Hanbay, S.; Blankson, M.A. Self-sensing damage assessment and image-based surface crack quantification of carbon nanofibre reinforced concrete. Constr. Build. Mater. 2017, 134, 520–529. [Google Scholar] [CrossRef]
Ding, W.; Yang, H.; Yu, K.; Shu, J. Crack detection and quantification for concrete structures using UAV and transformer. Autom. Constr. 2023, 152, 104929. [Google Scholar] [CrossRef]
Yan, Y.; Mao, Z.; Wu, J.; Padir, T.; Hajjar, J.F. Towards automated detection and quantification of concrete cracks using integrated images and lidar data from unmanned aerial vehicles. Struct. Control Health Monit. 2021, 28, e2757. [Google Scholar] [CrossRef]
Guo, P.; Meng, W.; Bao, Y. Automatic identification and quantification of dense microcracks in high-performance fiber-reinforced cementitious composites through deep learning-based computer vision. Cem. Concr. Res. 2021, 148, 106532. [Google Scholar] [CrossRef]
Peng, X.; Zhong, X.; Zhao, C.; Chen, A.; Zhang, T. A UAV-based machine vision method for bridge crack recognition and width quantification through hybrid feature learning. Constr. Build. Mater. 2021, 299, 123896. [Google Scholar] [CrossRef]
Ding, W.; Yu, K.; Shu, J. Crack Detection Method for Concrete Structures Based on Deep Learning and UAV. China Civ. Eng. J. 2021, 152, 104929. [Google Scholar]
Yang, N.; Zhang, C.; Li, T. Design of Crack Monitoring System for Wooden Structures of Ancient Chinese Buildings Based on Unmanned Aerial Vehicle and Computer Vision. Eng. Mech. 2021, 38, 13. [Google Scholar]
Hirschmuller, H. Stereo processing by semiglobal matching and mutual information. IEEE Trans. Pattern Anal. Mach. Intell. 2007, 30, 328–341. [Google Scholar] [CrossRef]
Hoang, N.D. Detection of surface crack in building structures using image processing technique with an improved Otsu method for image thresholding. Adv. Civ. Eng. 2018, 2018, 3924120. [Google Scholar] [CrossRef]
Meng, S.; Gao, Z.; Zhou, Y.; He, B.; Djerrad, A. Real-time automatic crack detection method based on drone. Comput.-Aided Civ. Infrastruct. Eng. 2023, 38, 849–872. [Google Scholar] [CrossRef]
Tan, M.; Le, Q. EfficientNetV2: Smaller Models and Faster Training. In Proceedings of the International Conference on Machine Learning, PMLR 139, Virtual, 18–24 July 2021. [Google Scholar]
Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
Zhang, Z.H.; Wen, Y.N.; Mu, H.W.; Du, X.P. Dual attention mechanism based pavement crack detection. J. Image Graph. 2022, 27, 2240–2250. [Google Scholar] [CrossRef]
Rosenfeld, A.; Pfaltz, J.L. Distance functions on digital pictures. Pattern Recognit. 1968, 1, 33–61. [Google Scholar] [CrossRef]
Özgenel, Ç.F. Concrete crack images for classification. Mendeley Data 2019, 2, 2019. [Google Scholar]
Glorot, X.; Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, JMLR Workshop and Conference Proceedings, Chia La guna Resort, Sardinia, Italy, 13–15 May 2010; pp. 249–256. [Google Scholar]
Zhang, Z. Flexible camera calibration by viewing a plane from unknown orientations. In Proceedings of the Seventh IEEE International Conference on Computer Vision, Kerkyra, Greece, 20–27 September 1999; Volume 1, pp. 666–673. [Google Scholar]

Figure 2. EfficientNetV2 basic module architecture: (a) MBConv structure; (b) Fused-MBConv structure.

Figure 3. SE module structure.

Figure 4. Improvements to the original modules: (a) CAM channel attention mechanism module; (b) SAM spatial attention mechanism module; (c) improved MBConv module; (d) improved Fused-MBConv module.

Figure 5. U-Net segmentation model architecture.

Figure 6. Actual value to pixel value conversion process: (a) the schematic diagram of binocular vision crack detection system; (b) principal diagram of binocular stereo vision system; (c) diagram of coordinate transformation.

Figure 7. Binarization of crack images.

Figure 8. Schematic of distance point values.

Figure 9. Methods for quantifying maximum crack width: (a) crack edge points; (b) crack width calculation based on minimum distance from the edge line; (c) cracks with abrupt changes in width. (d) Comparison of crack width quantification based on minimum distance from the edge line with crack width quantification based on the plumb line of the central axis.

Figure 10. Traversing points on a skeleton line vs. traversing points on an edge line.

Figure 11. The crack angle quantification method is used in this study.

Figure 12. Comparison of crack image recognition: (a) the result of the original EfficientNetv2 model for predicting normal cracks; (b) the result of the improved EfficientNetv2 model for predicting normal cracks; (c) the result of the original EfficientNetv2 model for predicting fine cracks; (d) the result of the improved EfficientNetv2 model for predicting fine cracks.

Figure 13. U-Net segmentation model training results:(a) U-Net segmentation model training loss; (b) U-Net segmentation model MIoU.

Figure 14. Comparison of the effectiveness of the Otsu method and U-Net for crack segmentation.

Figure 15. Binocular camera calibration: (a) mean error in pixels; (b) binocular camera and calibration plate attitude visualization.

Figure 16. The absolute error of conversion of width pixel values to actual values: (a) results of measured and predicted widths at crack locations; (b) absolute error of measured and predicted widths at crack locations.

Figure 17. Crack maximum width measurements with absolute error: (a) Crack maximum width prediction results of the traditional method and the method in this papers; (b) Errors in crack maximum width prediction between the traditional method and the method in this paper.

Figure 18. Visualization of the overall crack quantification process: (a) example 1; (b) example 2; (c) example 3.

Table 1. EfficientNetv2 network architecture [49].

Stage	Operator	Stride	Channels	Layers
0	Conv3 × 3	2	24	1
1	Fused-MBConv1, k3 × 3	1	24	2
2	Fused-MBConv4, k3 × 3	2	48	4
3	Fused-MBConv4, k3 × 3	2	64	4
4	MBConv4, k3 × 3, SE0.25	2	128	6
5	MBConv6, k3 × 3, SE0.25	1	160	9
6	MBConv6, k3 × 3, SE0.25	2	256	15
7	Conv1 × 1&Pooling&FC	-	1280	1

Table 2. Effect of different hyperparameters on prediction accuracy.

Parameter Name	Parameter Setting	Accuracy (%)
Batch size	1	94.80
	16	96.85
	32	96.95
Static learning rate	0.01	95.74
Dynamic learning rate	Cosine function	96.85
Optimizer	SGD	96.85
Optimizer	Adam	87.60

Table 3. Improved classification model ablation experiment results.

Network Model	Predictive Accuracy (Top5)	Projection Time (ms)
EfficientNetv2	95.3%	15.4
Improved EfficientNetv2	96.9%	15.6

Table 4. Binocular camera calibration results.

Camera Parameters	Left-Eye Camera	Right-Eye Camera
Internal reference matrix (pixel)	$[\begin{matrix} 519 . 133 & - 0 . 154 & 319 . 216 \\ 0 & 521 . 096 & 253 . 060 \\ 0 & 0 & 1 \end{matrix}]$	$[\begin{matrix} 517 . 718 & 0 . 429 & 314 . 897 \\ 0 & 519 . 805 & 248 . 878 \\ 0 & 0 & 1 \end{matrix}]$
Aberration parameter vector	$[\begin{matrix} - 0.056 0.238 0.001 - 0.001 - 0.282 \end{matrix}]$	$[\begin{matrix} - 0.057 0.249 0.002 - 0.002 - 0.282 \end{matrix}]$
Common focal length f (mm)	528.089
Baseline distance b (mm)	119.802

Table 5. Measurement results and errors of converting pixel widths to actual widths at crack locations.

No.		Image Depth Z₀ (mm)	Pixel Width d_i (Pixel)	p_xi (Pixel)	p_yi (Pixel)	Predicted Value by this Algorithm D_i (mm)	Actual Measured Value D_gti (mm)	Width Measurement Error ΔD_i (mm)
(a)	1	27.38 (ΔZ_0(a) = 0.71)	11.38	7	9	0.60	0.50	0.10
	2		9.48	3	9	0.50	0.45	0.05
	3		11.56	9	7	0.61	0.55	0.06
	4		34.50	22	27	1.82	1.80	0.02
	5		13.08	8	10	0.69	0.65	0.04
(b)	6	74.96 (ΔZ_0(b) = 0.69)	3.60	3	2	0.52	0.60	0.08
	7		3.95	2	3	0.57	0.60	0.03
	8		4.78	2	4	0.69	0.65	0.04
	9		5.96	4	4	0.86	0.70	0.16
	10		6.23	6	2	0.90	0.75	0.15
(c)	11	25.51 (ΔZ_0(c) = 0.84)	13.02	11	7	0.64	0.65	0.01
	12		14.86	14	5	0.73	0.70	0.03
	13		21.57	20	8	1.06	1.10	0.04
	14		15.47	14	7	0.76	0.65	0.11
	15		26.46	18	19	1.30	1.50	0.20
(d)	16	37.88 (ΔZ_0(d) = 0.51)	11.65	10	6	0.85	0.80	0.05
	17		15.08	10	11	1.10	1.20	0.10
	18		16.45	11	12	1.20	1.00	0.20
	19		15.08	10	11	1.10	1.25	0.15
	20		10.96	7	8	0.80	1.00	0.20

Table 6. Crack maximum width pixel value measurement results and errors.

No.	Actual Maximum Width D_gti (Pixel)	Methods in this Paper (Pixel)		Traditional Methods (Pixel)
No.	Actual Maximum Width D_gti (Pixel)	Predicted Maximum Width D_i	Error ΔD_i	Predicted Maximum Width D_i	Error ΔD_i
1	36	36	0	32	4
2	26	26	0	24	2
3	68	68	0	62	6
4	96	96	0	90	6
5	26	27	1	21	5
6	47	48	1	41	6
7	20	22	2	19	1
8	26	26	0	23	3
9	22	22	0	18	4
10	29	30	1	23	6
11	58	58	0	52	6
12	20	22	2	19	1
13	33	34	1	28	5
14	74	73	1	68	6
15	85	84	1	79	6
16	56	55	1	50	6
17	64	64	0	58	6
18	81	82	1	75	6
19	70	70	0	65	5
20	65	66	1	62	4

Table 7. Overall crack quantification results in predicted vs. measured values.

Image	Method	Image Depth Z₀ (mm)	Depth Error δ _Zi	Predicted Length L_i (mm)	Length Error δ _Li	Predicted Maximum Width D_i (mm)	Maximum Width Error δ _Di	Predicted Angle θ_i (°)	Angular Error δ θ_i
Figure 18a	Algorithm	49.28	0.42%	47.36	4.21%	0.32	18.52%	65.04	1.56%
Figure 18a	Experimental	50.00	-	47.56	-	0.27	-	64.05	-
Figure 18b	Algorithm	49.81	1.4%	43.49	2.30%	0.61	10.91%	89.36	0.16%
Figure 18b	Experimental	50.00	-	43.39	-	0.55	-	89.50	-
Figure 18c	Algorithm	50.21	0.38%	59.41	3.13%	1.30	6.56%	60.01	1.27%
Figure 18c	Experimental	50.00	-	59.60	-	1.22	-	59.25	-
Global Average	20 images used for testing		~1.0%		~2.0%		~4.50%		~1.0%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, L.; Shen, B.; Huang, S.; Liu, R.; Liao, W.; Wang, B.; Diao, S. Binocular Video-Based Automatic Pixel-Level Crack Detection and Quantification Using Deep Convolutional Neural Networks for Concrete Structures. Buildings 2025, 15, 258. https://doi.org/10.3390/buildings15020258

AMA Style

Liu L, Shen B, Huang S, Liu R, Liao W, Wang B, Diao S. Binocular Video-Based Automatic Pixel-Level Crack Detection and Quantification Using Deep Convolutional Neural Networks for Concrete Structures. Buildings. 2025; 15(2):258. https://doi.org/10.3390/buildings15020258

Chicago/Turabian Style

Liu, Liqu, Bo Shen, Shuchen Huang, Runlin Liu, Weizhang Liao, Bin Wang, and Shuo Diao. 2025. "Binocular Video-Based Automatic Pixel-Level Crack Detection and Quantification Using Deep Convolutional Neural Networks for Concrete Structures" Buildings 15, no. 2: 258. https://doi.org/10.3390/buildings15020258

APA Style

Liu, L., Shen, B., Huang, S., Liu, R., Liao, W., Wang, B., & Diao, S. (2025). Binocular Video-Based Automatic Pixel-Level Crack Detection and Quantification Using Deep Convolutional Neural Networks for Concrete Structures. Buildings, 15(2), 258. https://doi.org/10.3390/buildings15020258

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Binocular Video-Based Automatic Pixel-Level Crack Detection and Quantification Using Deep Convolutional Neural Networks for Concrete Structures

Abstract

1. Introduction

2. Methods

2.1. The Overall Process of Crack Detection

2.2. Improved Lightweight Crack Classification and Identification Algorithm

2.2.1. Lightweight Crack Classification and Identification Algorithm

2.2.2. Improved MBConv Module

2.3. High-Precision Crack Segmentation Method

2.4. Algorithm for Converting Pixel Values to Actual Distances

2.5. Algorithm for Crack Length Quantification

2.6. Algorithm for Quantifying Maximum Crack Width

2.7. Crack Angle Quantization Algorithm

3. Results and Discussion

3.1. Experiments on an Improved Lightweight Crack Classification and Identification Algorithm

3.1.1. Crack Classification Dataset Selection

3.1.2. Comparison of Hyperparameter Settings

3.1.3. Ablation Experiment

3.1.4. Crack Detection Experiment

3.2. Training Results of High-Precision Crack Segmentation Algorithm

3.3. Actual Value and Pixel Value Conversion Test

3.3.1. Experimental Equipment and Binocular Camera Calibration

3.3.2. Evaluation Results of Pixel Distance and Actual Distance Conversion

3.4. Crack Maximum Width Quantitative Test

3.5. Integral Crack Quantification Method Test Results

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI