1. Introduction
Hong Kong has emerged as one of the most economically advanced regions in the past five decades due to its unique cultural heritage and strategic location. This remarkable progress has been accompanied by significant improvements in urban infrastructure and facilities, including the construction of bridges and roads [
1]. While urban development has undeniably brought convenience to the lives of citizens, it has also given rise to a pressing issue—the maintenance of bridges. In our daily lives, numerous factors, such as cyclic loading, fatigue stresses, and adverse long-term environmental conditions, have hastened the deterioration of bridge surfaces, resulting in problems like cracks, ruts, and potholes [
2]. Among these concerns, cracks hold particular significance due to their direct impact on concrete structures’ safety, functionality, and durability [
3]. Cracks provide an entry point for corrosive chemicals within the concrete, allowing water and de-icing salts to penetrate bridge decks, potentially causing damage to superstructures and aesthetic elements [
4]. Since cracks are a critical indicator of a structure’s condition, they play a pivotal role in structural health monitoring [
5]. Consequently, conducting a rigorous study to assess the extent and severity of cracks is essential for evaluating the condition of bridges and maintaining a comprehensive database for long-term bridge inspections and analysis. As a result, numerous countries have established comprehensive maintenance plans for their bridges. For example, the United States mandates biennial bridge inspections in compliance with American Association of State Highway and Transportation Officials (AASHTO) requirements [
6]. In the United Kingdom, bridge inspections occur every one to three years in accordance with established standards [
7]. In China, specific standards dictate the detection and quantification of bridge cracks wider than 0.2 mm [
8].
In practical terms, due to its simplicity, human visual inspection remains the predominant method for monitoring bridge health [
9]. However, this method heavily relies on expertise, experience, and subjectivity, which can result in imprecise and unreliable assessment outcomes [
10,
11]. Furthermore, visual inspection is often impractical for inaccessible areas of a bridge, such as columns and intersections [
12]. In such cases, bridge inspectors resort to deploying inspection vehicles as manual operation platforms to measure the size of bridge cracks. Nevertheless, this approach has its own set of limitations, including traffic disruptions, time consumption, and high maintenance costs [
13]. The advent of computer vision models has paved the way for automatic inspection technologies, offering enhanced efficiency and precision in contrast to manual measurements, which are susceptible to inaccuracies and subjectivity [
14,
15].
However, the application of these technologies presents various challenges and limitations. Firstly, many studies in this field focus on crack detection under ideal conditions, often on surfaces consisting entirely of concrete or asphalt. In real-world scenarios, crack detection tasks frequently entail complex backgrounds with diverse elements, such as tree leaves and varying lighting conditions, rendering the accurate distinction of cracks challenging [
16]. Furthermore, traditional bridge distress inspection commonly involves image processing techniques employing edge detection and image thresholding. Cracks are identified based on changes in edge gradients derived from the intensity difference relative to the background and are extracted through threshold segmentation [
17]. However, this method is significantly affected by environmental factors during image collection, including variations in lighting conditions and the presence of oil stains [
3]. The current study introduces an integrated framework tailored explicitly for the segmentation of road cracks and the quantification of surface features, with a particular emphasis on addressing the challenges posed by complex backgrounds. In contrast to existing research in the same domain, our study presents several noteworthy contributions:
(1) We provide a comprehensive literature review that delves into the realms of bridge crack detection and segmentation, offering a thorough understanding of the existing body of knowledge in this field.
(2) To enrich the research landscape, we have meticulously curated a new bridge crack dataset customized to the distinctive context of Hong Kong. This dataset encompasses a diverse array of crack samples, faithfully representing the complexities encountered in real-world scenarios, including various crack sizes and intricate crack backgrounds.
(3) Our study pioneers the development of a novel YOLOv8-AFPN-MPD-IoU, an advanced approach facilitating precise and timely detection and segmentation of cracks within Hong Kong’s bridges with complex backgrounds, thus enhancing process efficiency. The AFPN is incorporated into the YOLOv8 neck, effectively reducing information gaps between non-adjacent layers and facilitating improved feature fusion. The MPD-IoU (minimum point distance intersection over union) loss function is deployed to solve the problem that the predicted bounding box possesses the same aspect ratio as the ground-truth bounding. In terms of recognition precision, model size, and reference times, the proposed model emerges as the optimal choice for deploying portable devices in practical applications.
(4) Furthermore, we employ a distance transform method (DTM) to accurately measure the length and width of segmented road cracks at the pixel level. This approach enhances precision and minimizes the impact of varying lighting conditions, a crucial aspect of our methodology.
2. Literature Review
Crack identification is mainly divided into traditional methods based on digital image processing and methods based on deep learning. Digital image processing methods based on edge detection and threshold segmentation are highly susceptible to ambient environmental conditions and image quality requirements. Additionally, these techniques may fail when encountered with real-world scenarios such as tree leaves and varying lighting conditions [
18]. Likewise, they remain highly dependent on the design of feature descriptors, and therefore, they are difficult to be generalized [
19,
20,
21]. In this respect, some previous research attempts relied on the use of threshold segmentation [
22], Otsu [
23], improved Otsu [
24,
25,
26], edge detectors [
27,
28], improved watersheds [
29], artificial neural networks [
30,
31], hybrid artificial neural networks [
32], and support vector machines [
33,
34,
35] for crack detection and assessment.
Aside from data dependence, deep learning addresses the limitations inherent in the typical machine learning models mentioned above. The advancement of graphical processing units (GPUs) has significantly accelerated image processing on computers. Consequently, deep learning techniques can now be effectively employed for tasks such as image object detection and segmentation, including the management of building construction [
36]. Convolutional neural networks (CNNs) are among the popular techniques used to extract category and location information from images in object detection tasks. However, the process of generating region proposals using selective search remains slow and exhibits limited accuracy for two-stage detectors [
37]. Rosso et al. [
38] presented a ResNet-50-based framework for defect classification in ground penetrating radar profiles. The authors implemented bi-dimensional Fourier transform as a preprocessing convolution operation, and vision transformer architecture was incorporated, resulting in better localization of defects. In another research effort, Park et al. [
39] introduced a machine-learning-based model for predicting depths of concrete cracks in thermal images. The four machine learning models included AdaBoost, gradient boosting, random forest, and multilayer perceptron. The AdaBoost model demonstrated the highest prediction accuracy with a determination coefficient and mean absolute percentage error of 98.96% and 0.6%, respectively. They also showed that the combination of principal component analysis with AdaBoost or gradient could properly detect microcracks.
Among the bounding box detection models, Zhang et al. [
40] created a single-stage detector based on you only look once (YOLOv3) to locate bridge surface defects like cracks, spalling, pop-outs, and exposed rebar. In another study, Deng et al. [
41] utilized YOLOv2 to locate cracks in real-world images contaminated with handwriting scripts. In a third study, Teng et al. [
42] investigated the use of 11 pre-trained deep learning networks to be leveraged as feature extractors of YOLOv2. Among them, there were Alex Net, VGG16, VGG19, ResNet50, ResNet101, Google Net, etc. They deduced that the integration of ResNet18 and YOLOv2 was deemed appropriate in terms of computational accuracy and efficiency. Similarly, Qiu and Lau [
43] elucidated that ResNet50-based YOLOv2 and YOLOv4-tiny demonstrate exceptional accuracy, fast processing speed, and notable proficiency in detecting small cracks in tiled sidewalks for crack detection. Xiang et al. [
44] addressed various challenges encountered by UAV-based crack detection networks, including low efficiency and accuracy stemming from road noise such as shadows, occlusions, and low contrast. Consequently, the YOLO architecture is recognized for its capability to identify cracks amidst complex backgrounds.
However, the bounding box outcomes provided by crack recognition lack the potential for further quantitative analysis, thereby rendering crack recognition less suitable as a crack measurement algorithm [
11]. The task of object segmentation evolves from object detection, requiring a comprehensive understanding of the image scene at the pixel level. In crack segmentation tasks, the input image typically contains only cracks and background. The resulting output consists of binary values of 1 and 0, representing crack pixels and background pixels, respectively [
45]. As a result, the outcomes of crack segmentation can facilitate subsequent geometric calculations of crack length and width. The human brain rapidly processes distinct shapes and irregular contours present in objects, similar to an objective segmentation task. Consequently, creating a detection frame outside the target object is deemed unnecessary and incongruent with human visual interpretations [
46]. Therefore, the utilization of the YOLOv8-seg model in bridge inspection endeavors represents a promising initiative.
3. Research Methodology
This research study comprises three distinct phases within its framework: a systematic review, data collection, and crack measurement. Illustrated in
Figure 1, this framework delineates a structured approach to bridge crack segmentation and measurement. The model developed is grounded in a conceptualization revolving around three essential modules. The initial module involves a comprehensive examination of existing models for detecting and quantifying surface cracks on bridge decks. This critical review not only identified research gaps, future research directions, and clear objectives but also prompted the utilization of a publicly available database to augment the diversity and complexity of image scenarios used in experimental settings. However, currently, the images provided lack sufficient power to fully capture the complex diversity of real-world scenes. Therefore, a set of supplementary images featuring various bridge defects was curated specifically for this purpose, utilizing data augmentation—an extensively employed method to mitigate overfitting in deep learning models while enhancing their diversity and generalizability [
47]. The images were captured at various locations across Hong Kong, encompassing a diverse array of environmental factors. These include variations in sunlight conditions, diverse surface textures, different bridge locations, instances of graffiti, and occurrences of water damage. Various transformations, such as lighting adjustments, random element addition, noise introduction, flipping, and image bounding box manipulation, were applied to augment the Hong Kong dataset. At this juncture, the integrated dataset combines the publicly accessible benchmark dataset of bridge surface cracks (BSD) sourced from [
48] alongside the Hong Kong bridge crack dataset (HKBCD). The third module of this study focuses on developing a novel YOLOv8-AFPN-MPDIoU model for surface crack segmentation, complemented by the utilization of the distance transform method (DTM) for precise calculation of bridge crack width and length.
4. Model Development
This section delineates the basic components of the developed YOLOv8-AFPN-MPDIoU model.
4.1. Basics of YOLOv8-Segmentation
Architecture of the developed YOLOv8-AFPN-MPDIoU model is depicted in
Figure 2. Its distinctive features lie in the following:
The YOLOv8 network offers support for object detection, tracking, and various additional tasks, including instance segmentation, image classification, and key-point detection. Similar to YOLOv5, YOLOv8 presents five distinct scales of models (n, s, m, l, x) with increasing depth and width from left to right [
49].
In alignment with the efficient layer aggregation network (ELAN) design philosophy, YOLOv8 replaces the C3 structure within the YOLOv5 backbone network with a C2f structure. This modification enables YOLOv8 to retain its lightweight characteristics while enhancing the flow of gradients [
50]. In comparison to YOLOv5, YOLOv8 demonstrates more pronounced disparities in its head section due to the integration of the widely adopted decoupled head structure.
YOLOv8 adopts the Task-Aligned-Assigner strategy for positive sample assignment in loss function calculation [
51]. Additionally, it introduces the distribution focal loss (DFL). During training, the strategy of disabling mosaic augmentation in the final ten epochs is incorporated, as inspired by YOLOX, to effectively enhance precision in the data augmentation process.
YOLOv8s-Seg is an extension of the YOLOv8 object detection model specifically tailored for performing segmentation tasks. This network draws upon the principles of the YOLACT network to achieve real-time instance segmentation of objects while maintaining a high segment mean average precision [
52]. The structural overview of the YOLACT network is presented in
Figure 3.
The YOLOv8-Seg network, version ultralytics 8.0.201, comprises three primary components: the backbone, the neck, and the head. In the associated GitHub repository, five distinct scale models of the network are available, specifically YOLOv8n-Seg, YOLOv8s-Seg, YOLOv8m-Seg, YOLOv8l-Seg, and YOLOv8x-Seg. In this study, experiments were conducted using YOLOv8-Seg models at various scales to assess the segment mAP50 and model size. Given the minimal presence of cracks in each image, the first four scales of models were utilized to identify the most suitable scale.
Figure 2.
Architecture of the developed YOLOv8-AFPN-MPDIoU model.
Figure 2.
Architecture of the developed YOLOv8-AFPN-MPDIoU model.
4.2. Asymptotic Feature Pyramid Network
In tasks related to object detection and segmentation, the multi-scale feature extraction process plays a key role in object encoding and scale change processing. A common approach is to utilize well-established top-down or bottom-up feature pyramid networks. However, these methods are susceptible to the problem of feature information loss and deterioration, which in turn adversely affects the integration of information between two non-adjacent layers. Within the prevailing architectures of feature pyramids, the challenge lies in the necessity for high-level features located at the pinnacle of the pyramid to traverse multiple intermediate scales and engage with features at these intermediate levels before finally fusing with the low-level features at the base. Throughout this process of propagation and interaction, there is a risk of the semantic information contained within the high-level features being lost or compromised. In contrast, the bottom-up pathway of the PAFPN (pyramid attention feature pyramid network) introduces a converse challenge: the detailed information originating from the lower-level features could potentially suffer loss or deterioration during the process of propagation and interaction.
4.2.1. Asymptotic Architecture
To address this challenge, the asymptotic feature pyramid network (AFPN) was developed by Yang et al. [
53]. The issue stems from the substantial semantic gap between non-adjacent hierarchical features, which is especially pronounced between the lowest- and highest-level features. This gap hinders the effective fusion of non-adjacent hierarchical features. However, the AFPN architecture is designed to be asymptotic, resulting in a closer alignment of semantic information across different hierarchical feature levels during the asymptotic fusion process. This, in turn, mitigates the aforementioned issues. As illustrated in
Figure 4, during the bottom-up feature extraction process of the backbone network, the AFPN gradually integrates low-level features, followed by intermediate-level features, before culminating in the fusion with the highest-level feature, which is the most abstract. Black arrows denote convolutions, while blue arrows indicate adaptive spatial fusion. In this proposed model, the AFPN is incorporated into the YOLOv8 neck, effectively reducing information gaps between non-adjacent layers and facilitating improved feature fusion.
4.2.2. Adaptive Spatial Fusion
The ASFF technique is utilized to assign variable spatial weights to different feature levels during multi-level feature fusion, enhancing the importance of critical levels while minimizing the impact of conflicting information from diverse objects. Illustrated in
Figure 5, features from three different levels are combined. This Figure serves as an illustration of feature fusion at three different levels, but we can adapt the method as needed for cases with more or fewer levels. Representing the feature vector transitioning from level
n to level
l at position
as
, the resultant feature vector
is attained through adaptive spatial fusion, formally defined as the linear combination of feature vectors
,
, and
(refer to Equation (1)).
At level
,
and
represent the spatial weights of the features across the three levels, with the constraint
. Due to variations in the number of fused features at each stage of the AFPN, we introduce stage-specific adaptive spatial fusion modules [
54]. The adaptive spatial fusion operation is shown in
Figure 5.
4.3. Minimum Point Distance-IoU Loss Function
The optimization process of the YOLOv8 model occurs across two dimensions: classification and regression. While the classification loss continues to employ binary cross-entropy loss (BCEL), the regression component integrates focal distribution loss (DFL) and bounding box regression loss (BBRL). Equation (2) comprehensively represents the loss function:
Here, denotes the total loss; , , and represent the weighting factors assigned to each loss term; and , and are individual loss functions for binary cross-entropy, focal distribution loss, and bounding box regression loss, respectively. BCEL loss is utilized for classification, assessing the dissimilarity between predicted class probabilities and ground-truth labels. DFL is applied for regression, considering the distribution of predicted bounding box values and focusing on challenging samples. Meanwhile, BBRL aims to minimize the difference between predicted and ground-truth bounding box coordinates.
Bounding box regression (BBR) has established itself as a crucial component in object detection and instance segmentation, serving as a pivotal step in precise object localization. Nevertheless, a substantial limitation of most prevailing loss functions for bounding box regression lies in their inability to be effectively optimized when the predicted bounding box possesses the same aspect ratio as the ground-truth bounding box but exhibits distinct width and height values. To address these aforementioned challenges, we conducted a comprehensive exploration of the geometric attributes associated with horizontal rectangles. Consequently, we introduce a novel bounding box similarity comparison metric, denoted as MPD-IoU (minimum point distance intersection over union), which encapsulates all the relevant factors considered in existing loss functions. This includes considerations such as the overlapping or non-overlapping area, the distance between central points, and variations in width and height, while concurrently simplifying the computational process.
In contrast to the complete intersection over union loss function (C-IoU loss function) employed in YOLOv8, the minimum point distance-IoU loss function (
MPD-
IoU loss function) is employed for comparison against other loss functions based on IoU [
55].
Figure 6 demonstrates a visual presentation of the proposed MPD-IoU loss function.
Figure 7 provides visual examples of predicted bounding boxes and ground-truth bounding boxes. They share the same aspect ratio but differ in width and height, where k > 1 and k ∈ R. In these visualizations, the yellow box signifies the ground-truth bounding box, while the red box and black box represent the predicted bounding boxes. Mathematically, the key components of the
MPD-
IoU metric are defined using Equations (3)–(8).
where, the overall loss function, denoted as
, is assigned weights (
λ₁,
λ₂,
λ₃) to balance the contributions of binary cross-entropy loss
, distance-based focal loss (
), and the proposed MPD-IoU-based loss (
). Here, (
,
) and (
,
) denote the coordinates of the central points of the ground-truth bounding box and the predicted bounding box, respectively. Similarly,
and
represent the width and height of the ground-truth bounding box, while
and
denote the width and height of the predicted bounding box.
The expressiveness of our metric is underscored by the fact that all the factors considered in existing loss functions can be determined by the coordinates of the top-left and bottom-right points, encompassing the non-overlapping area, central point distance, and deviations in width and height. As a result, our proposed metric not only offers a comprehensive consideration of these factors but also streamlines the calculation process. It is worth noting that when the aspect ratio of the predicted bounding box matches that of the ground-truth bounding box, the metric ensures that the predicted bounding box’s value is lower when it is contained within the ground-truth bounding box, compared to cases where the predicted bounding box extends beyond the boundaries of the ground-truth bounding box. This unique characteristic enhances the precision of bounding box regression, ultimately reducing redundancy in the predicted bounding boxes.
4.4. Crack Skeleton Extraction and Measurement
The extraction of crack skeletons serves as a crucial step in deriving geometric properties from segmented crack images, encompassing measurements such as length, width, and area. In this investigation, we employ the medial axis transformation (MAT) method to delineate the outline of the target crack based on binary images [
56]. This process involves selecting a point
within the interior region (
A) of any crack and identifying the point (
) on the crack boundary (
) that is closest to
. If multiple
points are found,
is designated as the skeleton point of the crack. The search for
is formalized in Equation (9):
where
denotes the lower bound,
and d represent Euclidean distances, and
z represents any point on
. Points with multiple neighbors are excluded from consideration. Subsequently, a distance transformation is executed from all foreground pixels to the background, with “1” representing the foreground and “0” denoting the background. This process is illustrated in
Figure 8. Finally, the target is reduced to a skeleton based on the outcomes of the distance transformation. Essentially, this process mirrors the boundary erosion process.
To describe MAT clearly, a diagram is used to explain how the algorithm calculates crack width and length. The coordinate system is shown in
Figure 9. Each pixel in the crack image is our coordinate [
57]. Therefore, a pixel coordinate system is established with red and yellow line segments representing the boundaries of the crack. The black line segment in the middle is the crack skeleton, calculated based on the central skeleton algorithm, and
W(
x) is the calculated average width of the crack. Part is
dl. Within the scope of this study, the width of each crack in the workpiece is represented by pixel values. Thus, each pixel represents the width or length, making it easy to calculate the number of pixels contained in the crack. The calculation formula for determining the crack length (
L) is shown in Formula (10).
The function acts as a geometric calibration index, aligning the detected displacements of points within the image. Simultaneously, denotes the finite length of each skeleton unit. represents the path of the crack skeleton. The crack skeleton is a central line that traces the main path of the crack through the image. This skeleton is derived from the central skeleton algorithm, which identifies the midline of the crack. Integration or summation along the crack skeleton accumulates the lengths of small segments dl corrected by .
The pixel-level crack width value is computed by the pixel distance calculation between the crack pixel (1) and the intact pixel (0) using the Euclidean distance transform (EDT) algorithm. The Euclidean distance (
d) of the two pixels,
and
, is expressed by Equation (11) [
58].
where
t is an indexing variable used to uniquely identify each pixel within the respective sets
and
to calculate the Euclidean distance between a cracked pixel and an intact pixel.
is the set of green pixels represented by 1 in
Figure 8, and
is the white pixel, complement of the green pixel set,
.
4.5. Performance Evaluation
The selection of suitable evaluation metrics holds paramount importance when assessing various segmentation models. While accuracy remains a prominent benchmark for existing instance segmentation models, the advent of real-time and lightweight models has introduced new considerations for practical deployment on equipment. Consequently, our proposed model shall undergo evaluation with a multifaceted approach, scrutinizing not only its accuracy but also its runtime performance and model complexity to ascertain its suitability for practical applications.
4.5.1. Accuracy
In this study, we conducted a comprehensive evaluation of the enhanced YOLOv8s-Seg model, employing a range of performance metrics, including precision, recall,
F1-score, and segment mAP at both 50% and 75% intersection over union (IoU) thresholds [
59,
60,
61]. Our assessment focused on two distinct aspects: the accuracy of crack detection, which was quantified using precision, recall, and
F1-score, and the quality of the segmentation results, which were appraised through segment mAP.
Here, TP represents true positive samples, which are actual positive instances correctly predicted as such. FP indicates false positive samples, signifying actual negative instances erroneously identified as positive, while FN stands for false negatives, representing actual positive instances wrongly classified as negative. AP(i) corresponds to the average precision for each segmentation category, and C represents the total number of segmentation categories. Notably, the segmentation model’s performance is directly linked to the cumulative AP score, and a higher AP score is indicative of improved segmentation quality.
4.5.2. Computational Time
In the realm of model evaluation, two critical indicators to gauge efficiency are training times and inference times. Training time denotes the duration required for a model to converge during the training process, while inference time signifies the time taken to process an image, thereby reflecting the model’s suitability for practical industrial applications. To streamline the assessment process, the use of FPS (frames per second) has become a prevalent metric for evaluating the efficiency of instance segmentation methods. This metric offers a straightforward means for researchers to identify and select faster inference models when operating under comparable conditions [
62,
63].
4.5.3. Model Complexity
Model complexity plays a crucial role in evaluating the practical usability of a model, as it directly influences storage requirements and computational demands. In our study, it specifically refers to the storage needs linked to model parameters, a factor closely tied to the implementation of algorithms on mobile devices.
6. Conclusions
This study introduced a novel computer vision model aimed at effectively segmenting bridge cracks and deriving precise geographic information from the segmented crack images at the pixel level. The traditional method of human visual inspection is not only time-consuming and subjective but also poses risks such as falls and injuries. Although existing object detection models have mitigated these issues, they primarily rely on qualitative analyses for crack evaluation. In contrast, our proposed model offers a groundbreaking quantitative analysis, empowering bridge inspectors to make informed decisions. The superiority of the developed YOLOv8s + AFPN + MPDIoU model lies in the following:
Balanced accuracy and speed: By employing a one-stage instance model, we strike a crucial balance between segmentation accuracy and processing speed. This approach avoids the limitations of two-stage methods that may not comprehensively capture the interplay between detection and segmentation. Multi-stage methods, often associated with extended processing times, are circumvented, enabling real-time segmentation.
Innovative feature fusion: The incorporation of an asymptotic feature pyramid network in the YOLOv8-seg model replaces conventional top-down or bottom-up feature pyramid networks. This innovative choice addresses the loss and deterioration of feature information between non-adjacent layers, facilitating improved feature fusion and enhancing the overall model performance.
Specialized loss function: The introduction of the minimum point distance-IoU loss function tackles issues arising when the predicted bounding box possesses the same aspect ratio as the ground-truth bounding box but exhibits distinct width and height values. This tailored loss function ensures a more accurate and reliable model.
Quantitative measurement method: The combination of the middle aisle transformation method and Euclidean distance method to calculate the length and width of bridge cracks in segmented images provides a quantitative basis for maintenance suggestions. This method enhances the precision of crack assessment.
Furthermore, in terms of the loss function selection, the MPD-IoU loss function emerges as the optimal choice, achieving the highest recall score. Given that every crack poses a potential threat to bridge health, this selection is crucial. Comparative analysis with other instance models reveals that the YOLOv8s + AFPN + MPDIoU model outperforms others in terms of model size, inference time, and mAP50, making it the optimal choice for practical industry applications. Additionally, the evaluation of max width and length errors between manual inspection and segmented crack results indicates a minimal discrepancy of around 5%, significantly smaller than the errors associated with visual inspection. This underscores the reliability and precision of our proposed model in assessing bridge cracks. In summary, our computer vision model stands out as an innovative, efficient, and reliable solution for bridge inspection and maintenance in the industry.