Integration of Probability Maps into Machine Learning Models for Enhanced Crack Segmentation in Concrete Bridges

Tyvoniuk, Volodymyr; Trach, Roman; Trach, Yuliia

doi:10.3390/app15063201

Open AccessArticle

Integration of Probability Maps into Machine Learning Models for Enhanced Crack Segmentation in Concrete Bridges

by

Volodymyr Tyvoniuk

^1,2

,

Roman Trach

^1,2,*

and

Yuliia Trach

^1,2

¹

Institute of Civil Engineering, Warsaw University of Life Sciences, 02-776 Warsaw, Poland

²

Institute of Civil Engineering and Architecture, National University of Water and Environmental Engineering, 33028 Rivne, Ukraine

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(6), 3201; https://doi.org/10.3390/app15063201

Submission received: 10 February 2025 / Revised: 4 March 2025 / Accepted: 13 March 2025 / Published: 14 March 2025

(This article belongs to the Special Issue Inspection and Monitoring Techniques for Bridges and Civil Structures, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

:

Crack segmentation in concrete bridge structures is a critical task for ensuring safety and durability. This study focuses on evaluating and improving the performance of various deep learning models for crack segmentation, including U-Net, SegNet, ENet, HRNet, FastFCN, and DeepLab V3+. A novel approach is proposed which integrates a probability map generated by an ensemble of classification models as an additional input channel for segmentation models. This method demonstrated significant improvements in segmentation quality, increasing the IoU by up to 25.91% and F1 score by 15.39% compared with baseline models. These improvements were achieved through the use of additional spatial information provided by the probability map, enabling the models to detect cracks more precisely. Additionally, to evaluate the relevance of this approach, the results were compared with YOLO11x-seg, the latest and largest version for segmentation. These findings highlight that integrating auxiliary data channels into neural network architectures holds promise for enhancing segmentation accuracy in real-world engineering applications. The results of this study provide valuable insights for structural engineers and researchers working on automated crack detection, contributing to the development of reliable tools for structural health monitoring.

Keywords:

concrete crack segmentation; deep learning; probability map; structural health monitoring

1. Introduction

In modern infrastructure, concrete and reinforced concrete structures form the backbone of urban and transportation networks, providing durability, strength, and versatility. Bridges, as one of the most critical applications of these materials, play a key role in ensuring the efficiency and safety of transportation systems. They serve as essential elements for the movement of people and goods, significantly influencing economic stability and growth [1]. However, as bridges age and loads increase, they become vulnerable to structural degradation. The main factors contributing to this process include natural disasters, excessive traffic loads, and design deficiencies [2]. Given the interconnected nature of transportation infrastructure, functional bridge failures can result in severe economic losses, disruptions to business operations, and social consequences [3]. Effective management and maintenance of bridges are therefore crucial to ensuring the long-term reliability of infrastructure.

Effective bridge management relies on accurate and timely inspection data, which must include a comprehensive list of detected defects, their dimensions, and precise locations on the structure [4,5]. Traditional inspection methods, while widely used, are resource-intensive, requiring significant time and human effort. For instance, visual inspections, which are commonly employed in many countries, depend on the expertise of specialists, making the process subjective [6]. Given these limitations, modern technologies such as laser scanning and photogrammetry offer more precise and efficient alternatives [7,8].

Photogrammetry using unmanned aerial vehicles (UAVs) enables detailed analysis of the upper and lateral parts of bridges, while ground-based cameras provide high-resolution images of the lower sections. Combining methods such as Structure from Motion (SfM) and Multi-View Stereo (MVS) facilitates the creation of accurate 3D models for structural analysis [9]. Transitioning to these technologies significantly enhances the efficiency, accuracy, and reliability of bridge condition assessments.

The resource-intensive nature of traditional methods underscores the need for automation in the inspection process. Inspecting a standard four-lane bridge can take up to eight hours and cost approximately USD 4600, with additional expenses for lane closures potentially reaching USD 14,600 [10]. The use of UAVs reduces the inspection time to one hour and costs to about USD 1200. Moreover, robotic systems for inspecting hard-to-reach parts of bridges and underwater drones for examining piers add flexibility and efficiency [11].

Modern computer vision systems, which are continuously improving, find applications across various fields, including building condition monitoring [12,13]. Computer vision encompasses a set of algorithms which process images to extract useful information [14,15]. The primary tasks of computer vision include classification, detection, and segmentation of objects. Notably, segmentation assigns each pixel of an image a corresponding label, which is critically important for analyzing defects [16].

In segmentation tasks, the U-Net algorithm and its modifications demonstrate high accuracy, exceeding 90% in medical studies [17]. The application of deep neural networks, particularly convolutional neural networks (CNNs), significantly improves the accuracy and efficiency of detecting structural defects [18].

This study proposes an approach to improving crack segmentation accuracy which uses probabilistic maps generated by an ensemble of classification models. The relevance of this research lies in the need to enhance bridge inspection processes, which is a critical task for maintaining infrastructure under increasing loads and aging structures. Unlike traditional methods which rely solely on segmentation networks, our methodology incorporates probabilistic maps as an additional input channel for segmentation models. This ensures significant improvements in defect detection accuracy, particularly in defining their boundaries, and advances the development of automated bridge inspections and other concrete structure analyses. The proposed approach achieves an intersection over union (IoU) accuracy increase ranging from 1.45% to 25.91%, depending on the neural network used, while maintaining ease of integration into existing automation systems.

2. Methods

2.1. Overview of the Input Data

This research focuses on improving the segmentation of defects in bridge structures, though the proposed approach can also be applied to other buildings and structures made of concrete and reinforced concrete. Concrete and steel are the primary materials used for bridge construction, with wood being used less frequently [19]. Globally, bridges are predominantly constructed from reinforced and prestressed concrete [5,20]. Approximately 67% of all bridges in Finland are either concrete or contain concrete elements [21]. In the United States, 254,965 out of 611,833 bridges are made of concrete [22]. Given this prevalence, this research focused on detecting concrete defects, particularly cracks.

The images used for training and validation were sourced from open datasets and supplemented with photographs taken by the authors on bridge structures. The images were resized and cropped to 1280 × 1280 to prevent distortion and blurring of fine cracks when fed into segmentation models. Annotation was carried out using tools such as Computer Vision Annotation Tool (CVAT) version 2.30.0 [23] and Label Studio version 1.15.0 [24]. These open-source tools support the annotation of images in various formats for classification, object detection, and segmentation tasks.

All images in the dataset contain cracks, as the primary objective of this study was crack segmentation rather than classification. Thus, the presence of crack-free images was not considered crucial for evaluating segmentation performance. However, the dataset includes images where cracks coexist with other structural defects, reflecting real-world conditions where multiple types of damage may be present simultaneously. This ensured the model’s robustness in distinguishing cracks even in complex scenarios.

After annotation, the segmentation masks were exported in PNG format. For You Only Look Once (YOLO11x-seg), these masks were converted using the Open Computer Vision Library (OpenCV) version 4.11.0.86 [25], which identified all contour points of the masks. The coordinates of these points were normalized by dividing them by the width and height of the image and then adding them to a list.

The dataset was divided into training and validation sets at a 70% to 30% ratio, with 329 images in the training set and 141 images in the validation set, totaling 470 images. The split was performed randomly to ensure that the training and validation sets contained a diverse representation of crack patterns and structural conditions. Random splitting helps prevent biases which can arise from manual selection and ensures that the model generalizes well across different test scenarios. These datasets were used across all machine learning models for segmentation. Examples of the training and validation sets, along with their annotations (masks), are shown in Figure 1.

For training classification models, the training and validation datasets were divided into fragments 32 × 32, 64 × 64, 128 × 128, and 256 × 256 in size. No overlap was applied during fragmentation.

The selection of fragments was performed using OpenCV, where the full image mask was analyzed to identify crack regions. Initially, fragments containing cracks, where at least 10% of the pixels belonged to a crack, were extracted for each scale. Then, an equal number of randomly selected fragments from the same image, containing no crack pixels, were added to balance the dataset.

The final dataset sizes were as follows:

Training set: 11,266 fragments (5633 with cracks, 5633 without cracks);
Validation set: 9018 fragments (4509 with cracks, 4509 without cracks).

This approach ensured a balanced distribution of crack and non-crack samples while preventing bias and maintaining consistency across different fragment sizes.

2.2. Ensemble of Models for Classification and Generation of Crack Probability Maps

2.2.1. Preliminary Analysis of Classification Models

Binary classification is one of the fundamental tasks in computer vision, involving the determination of whether a given object is present in an image or not. In this study, lightweight CNN models were considered for classification:

MobileNetV2;
MobileNetV3-Small;
MobileNetV3-Large;
AlexNet;
ShuffleNet V2×0.5.

The selection of these models was motivated by the goal of minimizing their impact on the segmentation time.

The MobileNetV2 architecture, introduced in 2018 as an improvement over MobileNetV1, was designed for efficient image recognition on mobile and low-power devices [26].

The key features of MobileNetV2 are as follows:

Inverted Residuals: Instead of the traditional residual connections used in ResNet, the authors proposed an inverted residual structure. Classical residual blocks operate with the following sequence: deep layer → 3 × 3 convolution → channel expansion → 1 × 1 convolution. In contrast, the inverted residual block reverses the order. First, the channels are expanded (1 × 1 convolution), followed by depthwise convolution (3 × 3). Finally, the channels are compressed (1 × 1 convolution).
Linear Bottlenecks: MobileNetV2 introduces the concept of linear bottlenecks, where no ReLU activation is applied at the narrowest point (after channel compression). This approach prevents information loss which may occur due to the ReLU function.
Depthwise Separable Convolution: Similar to MobileNetV1, this model employs depthwise separable convolution. However, these operations are further optimized by combining depthwise convolutions with channel expansion.
Expansion Factor: Each block initially expands the channels sixfold before applying convolution, allowing the model to retain more information.

The MobileNetV3 model, introduced in 2019, builds upon MobileNetV2 by incorporating automated architecture search (NAS), Hard Swish activations, and Squeeze-and-Excitation blocks. MobileNetV3 is available in two versions—large and small—designed for different resource constraints [27].

Automated architecture search enabled the identification of optimal parameter combinations for balancing performance and energy efficiency. Unlike ReLU6, which in used in MobileNetV2, the new architecture employs Hard Swish (h-swish), a smoother non-linear activation function which helps prevent gradient vanishing. Although ReLU is still used in some layers, h-swish is predominantly applied. Another improvement is the inclusion of Squeeze-and-Excitation (SE) blocks, which dynamically adjust channel weights to enhance feature selection.

The primary difference between the large and small versions lies in the number of standard 3 × 3 convolutional layers (both standard and depthwise) and the complexity of operations. The large version includes more complex operations, whereas the small version replaces them with lighter ones to suit lower-resource environments.

Although AlexNet model was introduced in 2012 [28], its architecture, consisting of five convolutional layers (some accompanied by max-pooling layers) and three fully connected layers, remains relatively simple yet effective for binary classification tasks. The network comprises approximately 60 million parameters and 650,000 neurons. ReLU is used as the activation function, which ensures nonlinear transformations and efficient gradient propagation. To prevent overfitting, the dropout technique is applied in the fully connected layers, randomly deactivating a portion of the neurons during training.

Despite its simplicity compared with more recent models, this architecture can serve as a robust baseline for binary classification, particularly when computational resources are limited or when working with moderately sized datasets. Its design highlights the balance between computational efficiency and classification accuracy, making it a valuable choice in specific scenarios.

ShuffleNet V2 is an enhanced version of the ShuffleNet architecture designed to improve the efficiency of convolutional neural networks on mobile devices. The architecture was introduced in 2018 [29].

ShuffleNet V1, introduced in 2017 [30], was developed as a lightweight model optimized for mobile devices by utilizing group convolutions and channel shuffling to improve computational efficiency. The key innovation of ShuffleNet V1 was its ability to reduce computational complexity while maintaining high accuracy by introducing pointwise group convolutions and channel shuffle operations, which helped facilitate information exchange across feature maps.

One of the primary changes in ShuffleNet V2 is the elimination of group convolutions, which were found to be memory-intensive and reduce parallelism, thereby decreasing computational speed. Another distinctive feature of this model is the use of the “channel split” operation. At the beginning of each block, the channels are divided into two branches; one branch remains unchanged (identity), while the other undergoes a series of convolutions. The results from both branches are then combined and shuffled, enabling efficient information exchange between the channels.

The model was introduced in four variants, with the smallest (0.5×) containing 1.4 million parameters and the largest (2×) containing 7.4 million parameters, allowing users to select an appropriate balance between performance and computational complexity.

2.2.2. Training and Comparison of Classification Models

The training and validation of the considered models were conducted using the dataset described in Section 2.1. All fragments, regardless of their initial size, were resized to a uniform dimension of 64 × 64. To prevent overfitting and underfitting, all models utilized an early stopping mechanism based on the loss metric, ensuring the models were saved at the lowest loss value and highest F1 score.

The following hyperparameters were employed during training:

Optimizer: AdamW was chosen due to its improved stability and efficiency in updating weights compared with standard Adam. The inclusion of weight decay in AdamW helps mitigate overfitting.
Learning Rate: We selected 0.0001 empirically after the preliminary experiments. Higher values (e.g., 0.001) caused training instability, while lower values (e.g., 0.00001) significantly slowed convergence.
Weight Decay: We used 0.01 for model regularization to prevent overfitting. The value 0.01 was deemed to be optimal based on the validation set performance.
Loss Function: BCELoss was selected due to its suitability for binary classification tasks, such as detecting the presence of cracks at the pixel level.
Batch Size: We experimentally determined the batch size to be 64. Both smaller and larger values resulted in poorer validation performance, making 64 the optimal choice for stable training and efficient convergence.

After training, the models were compared using the loss and F1 score metrics. This analysis included five models (three models from the MobileNet family) following a consistent training and validation scheme, allowing for evaluation of their performance under identical conditions.

2.2.3. Formation of an Ensemble Model for Generating Crack Probability Maps

Based on the training results, the MobileNetV2 model achieved the best validation metrics, followed by MobileNetV3 and ShuffleNetV2 × 0.5. While ShuffleNet demonstrated better loss performance compared with the small version of MobileNetV3, it performed worse than the large version. Conversely, the F1 score metric for ShuffleNet was slightly better than the large version and slightly worse than the small version. Consequently, MobileNetV2 was chosen as the primary model for the ensemble, complemented by ShuffleNetV2 × 0.5 for small-sized patches, which was chosen as a balance between these two models. For patches of sizes 64 × 64 and 128 × 128, MobileNetV2 trained with the F1 score metric was used, as the primary task at these scales was to reduce false negatives and false positives.

Fragments of different sizes were considered, ranging from 8 × 8 to 256 × 256. However, fragments smaller than 32 × 32 contained too little information, leading to poor classification results. Additionally, using fragments larger than 128 × 128 reduced the accuracy of the probability maps. The use of 256 × 256 fragments did not improve the results and instead introduced excessive smoothing, reducing segmentation quality. The final choice of fragment sizes (32 × 32, 64 × 64, and 128 × 128) was made based on extensive testing to balance detail preservation and computational efficiency.

Table 1 presents the validation metrics for all considered models. The models were saved based on both the lowest loss and highest F1 score to ensure a fair comparison.

All models were tested with all fragment sizes simultaneously. The metrics presented in Table 1 are the validation scores for all fragments combined. While no separate evaluation was conducted per fragment size, small-scale testing on the selected datasets confirmed that the validation results aligned with the probability map quality. However, individual models performed significantly worse than the ensemble approach. Two models (MobileNetV2 and ShuffleNetV2 × 0.5) were selected for the 32 × 32 fragments because models at this scale often struggle with false detections due to concrete texture variability. Using two models helped mitigate this issue by leveraging the strengths of each.

Various weighting schemes were tested for combining model outputs. However, using unequal weight coefficients led to unstable results, where a model performed well in certain conditions but worse in others. Consequently, equal weighting was chosen to ensure consistent probability map quality in all cases. Similarly, different scaling factors were explored, but they either introduced excessive noise or reduced the probability map accuracy. Ultimately, the selected approach involved multiplying the probability maps by 0.25 and summing the resulting maps to achieve a stable and reliable fusion.

The overall ensemble structure is as follows:

MobileNetV2 (loss), 32 × 32;
ShuffleNetV2×0.5 (loss), 32 × 32;
MobileNetV2 (F1), 64 × 64;
MobileNetV2 (F1), 128 × 128.

Figure 2 illustrates the visualization of probability maps from each model in the ensemble and the combined probability map of a crack’s presence.

As shown, the use of two models for classifying 32 × 32 fragments and two models for larger fragments effectively suppressed false positive detections while preserving the detail of the probability map.

This ensemble selection strategy ensured a balance between classification accuracy and computational efficiency, reducing false positives for small fragments while maintaining high-quality probability maps for crack detection.

2.2.4. Dataset Generation with a Fourth Channel

The input images in the dataset were divided into fragments of sizes 32 × 32, 64 × 64, and 128 × 128, with a step size equal to half the fragment size. These fragments were then classified using an ensemble of models, and the classification results were stored in the alpha channel of the PNG file. High confidence for the “crack” class was represented by maximum opacity, while low confidence corresponded to full transparency. Importantly, the information in the RGB channels was preserved, enabling segmentation models to adapt to the fourth channel rather than relying solely on the remaining RGB data post-classification.

No additional post-processing was applied.

2.3. Image Segmentation Models

To evaluate the effectiveness of the proposed approach, this study considered lightweight models for real-time segmentation (ENet), well-established architectures such as U-Net and SegNet, as well as relatively recent models, including DeepLabV3+, FastFCN, and HRNet. Additionally, a comparison was conducted with the state-of-the-art real-time segmentation model YOLO11x-seg.

The DeepLabV3+ model was introduced in 2018 as a modification of DeepLabV3 [31] for semantic segmentation. The primary improvement was the addition of a simple yet effective decoder to refine the segmentation results, particularly along object boundaries.

In addition, the model enhanced the Atrous Spatial Pyramid Pooling (ASPP) module, which captures multi-scale contextual information by applying atrous convolution with varying rates.

The architecture of the model used in this study is schematically presented in Figure A1.

Figure A1 shows the DeepLabV3+ architecture, illustrating the role of the encoder, ASPP module, and decoder in producing refined segmentation masks. The encoder is based on a ResNet50 [32] backbone, while the ASPP module captures features at different scales, and the decoder improves the spatial resolution for object boundary delineation.

The ENet model was introduced in 2016 for real-time semantic segmentation tasks [33]. The model operates 18 times faster and has 79 times fewer parameters than the state-of-the-art models of its time and achieves comparable accuracy. A key feature of the model is its use of a large encoder and a small decoder. According to the authors, the encoder should mimic the behavior of original architectures by processing low-resolution data, while the decoder’s role is to upsample the encoder’s output and refine the details.

Another notable feature is the removal of most ReLU activations in the initial layers. Additionally, the authors proposed replacing ReLU with PReLU, which allows negative activations to have nonzero values.

The architecture of the ENet model is schematically shown in Figure A2.

Figure A2 illustrates the ENet architecture, highlighting the structure of its encoder and decoder. The encoder processes input images at reduced resolutions to efficiently extract features, while the lightweight decoder progressively restores spatial details in the segmentation output.

The FastFCN architecture was introduced in 2019 [34]. Its main innovation lies in replacing dilated convolutions in the backbone with standard convolutions, thereby restoring the original architecture, such as ResNet, without dilation. Instead of dilated convolutions, the authors proposed a novel approach called Joint Pyramid Upsampling (JPU).

The JPU architecture employs a feature pyramid approach, gathering features from multiple layers (e.g., layers 3, 4, and 5 in ResNet). To increase the resolution of these features, upsampling is performed using three parallel 3 × 3 convolutions with different stride levels. Multi-scale features are then combined using channel-wise concatenation. According to the original publication, this method enables the extraction of global features with lower computational costs compared with traditional methods.

Additionally, this model incorporates the Context Encoding Module (CEM) [35] to capture global context and enhance semantic understanding. This reduces dependency on local noise and improves model stability.

The architecture used in this study is schematically shown in Figure A3.

Figure A3 illustrates the FastFCN architecture, highlighting the integration of JPU for multi-scale feature extraction and CEM for global context representation. The combination of these modules enhances segmentation accuracy while maintaining computational efficiency.

The High-Resolution Network (HRNet) was introduced in 2019 [36]. The architecture is designed to maintain high-resolution representations throughout the entire processing pipeline.

The key features of HRNet include the following:

Parallel branches with varying resolutions: Unlike traditional architectures which progressively reduce the resolution, HRNet maintains multiple parallel branches with different resolutions.
Multi-level information fusion: A continuous exchange of information between branches enables the generation of more precise representations.
High accuracy and spatial expansion: By preserving the resolution, HRNet ensures more accurate representations in tasks requiring precise spatial information.

The architecture used in this study is schematically shown in Figure A4.

Figure A4 illustrates the HRNet architecture, emphasizing its multi-resolution branches and the fusion process, which enables accurate and detailed spatial representations for segmentation tasks.

The SegNet architecture was introduced in 2015 [37] as an efficient neural network for semantic segmentation, targeting mobile applications and real-time computation. The primary innovation of this model lies in its encoder-decoder structure with a unique upsampling mechanism.

The key features of SegNet include the following:

Encoder-decoder architecture: The encoder consists of 13 convolutional layers, analogous to the first 13 layers of the VGG16 network, and utilizes max pooling to progressively reduce the feature dimensions. The decoder, in turn, is responsible for restoring the resolution of the segmented image.
Index-based upsampling: The core innovation of SegNet is the use of stored max-pooling indices from the encoder during upsampling in the decoder. This allows upsampling without the need to train additional parameters, significantly reducing memory requirements and improving the model speed.
Memory and computational efficiency: Thanks to its architecture, SegNet uses fewer parameters compared with other deep segmentation networks (e.g., FCN or DeepLab), making it suitable for deployment on devices with limited computational resources.
Segmentation accuracy: Restoring the resolution via pooling indices helps preserve object details, which is crucial for semantic segmentation tasks that require pixel-level precision.

The architecture used in this study is schematically shown in Figure A5.

Figure A5 illustrates the SegNet architecture, highlighting its encoder-decoder structure and index-based upsampling mechanism, which contribute to its efficiency and accuracy in segmentation tasks.

The U-Net model was introduced in 2015 [38] as an architecture for semantic segmentation specifically designed for processing medical images. Due to its efficiency, it has become popular for a wide range of segmentation tasks, including satellite imagery, agronomy, biological analysis, and material defect detection.

The key features of U-Net include the following:

Symmetrical encoder-decoder architecture: The encoder consists of convolutional layers, each followed by max pooling to reduce the image dimensions and extract key features. The decoder gradually increases the resolution using transposed convolutions to restore the pixel representation of segmented objects.
Skip connections: A defining feature of U-Net is the presence of skip connections between the encoder and decoder. These connections allow high-resolution information from early encoder layers to be directly passed to corresponding decoder layers, significantly improving the accuracy of detail recovery.
Preservation of object boundaries and precise localization: This strategy helps retain the contours of objects and precise localization, making the model highly effective for tasks which require recognition of fine details.
High accuracy on small datasets: Due to its multi-level processing and efficient use of spatial information, U-Net delivers excellent results even on small datasets.

The architecture used in this study is schematically shown in Figure A6.

Figure A6 illustrates the U-Net architecture, emphasizing its symmetrical encoder-decoder structure and the use of skip connections to enhance segmentation accuracy and detail preservation.

The YOLOv11 model is the latest iteration in the “You Only Look Once” (YOLO) series developed by Ultralytics. It continues the tradition of providing high speed and accuracy in computer vision tasks. In the updated architecture, the C2f block from previous versions has been replaced with a C3k2 block. Additionally, the model retains the SPPF component and introduces the C2PSA module, which enhances feature extraction [39].

Other improvements in YOLOv11 [40] include the following:

Enhanced feature extraction;
Optimized speed and efficiency;
Improved accuracy with fewer parameters.

In this study, the largest version of the model, YOLO11x-seg, was utilized.

The architecture used in this study is schematically shown in Figure A7.

Figure A7 illustrates the YOLOv11 architecture, emphasizing its advanced components, including the C3k2 block and the C2PSA module, which collectively enhance its feature extraction and segmentation capabilities while maintaining efficiency.

2.4. Training Parameters for Models

All considered models were trained on the same datasets and with identical hyperparameters to allow for a fair comparison under consistent conditions. The only parameter that varied was the batch size, which was determined by the limitations of the GPU used for training and experimentally optimized for each model. To prevent overfitting, early stopping was employed, with the best models saved based on the loss and IoU metrics. The early stopping function monitored the validation metrics, and training was halted if no improvement in the loss or IoU was observed for 50 epochs.

For using the 4-channel dataset, the first convolutional layer of each model was modified to accept all four channels. In models utilizing pretrained ResNet backbones, the initial weights for the fourth channel were calculated as the average of the weights for the other three channels.

The rest of the hyperparameters for model training are shown in Table 2.

The training was conducted on a personal computer with the following specifications:

CPU: AMD R5 5600;
RAM: 32 GB;
GPU: Nvidia RTX 4060 Ti;
OS: Ubuntu 22.04.

2.5. Evaluation Metrics

All models were evaluated using consistent metrics, with additional comparisons made against the original YOLO11x-seg model. The machine learning models in this study were assessed based on the following metrics: precision, recall, F1 score, IoU, and loss. Additionally, the epoch at which the best-performing model was saved was recorded.

Precision in segmentation tasks measures the proportion of correctly predicted pixels belonging to a class out of the total number of pixels predicted for that class:

P r e c i s i o n = \frac{T P}{T P + F P}

(1)

where TP (true positive) is the number of pixels correctly classified as belonging to the target class and FP (false positive) is the number of pixels incorrectly classified as belonging to the target class.

Recall measures the proportion of actual pixels belonging to a class which were correctly identified by the model. This metric reflects the model’s sensitivity to detecting pixels of a given class:

R e c a l l = \frac{T P}{T P + F N}

(2)

where FN (false negative) is the number of pixels which were missed and classified as not belonging to the class.

The F1 score is the harmonic mean of precision and recall, accounting for both false positives and false negatives:

F 1 = 2 * \frac{P r e c i s i o n * R e c a l l}{P r e c i s i o n + R e c a l}

(3)

Here, the intersection over union (IoU), or the Jaccard Index, measures the degree of overlap between the predicted mask and the ground truth mask:

I o U = \frac{I n t e r s e c t i o n}{U n i o n}

(4)

where Intersection is the number of pixels belonging to both the predicted mask and the ground truth mask and Union is the total number of pixels belonging to at least one of the masks (either predicted or ground truth).

3. Results

This section presents the results of the models considered for crack segmentation on concrete surfaces, comparing the outcomes for models with three and four input channels as well as the YOLO11x-seg model. The proposed approach, which incorporates adding a probability map of object presence for segmentation obtained from an ensemble of binary classification models as the fourth input channel, demonstrated significant improvements in segmentation quality for metrics such as the precision (up to 13.64%), recall (up to 26.32%), F1 score (up to 15.39%), and IoU (up to 25.91%).

Table 3 shows the precision results for the models with three and four channels, saved based on the loss and IoU metrics. In this context, the three-channel models represent the baseline architectures without modifications, while the four-channel models integrate a probability map, generated by an ensemble of classification models, as an auxiliary input.

The results indicate that the proposed approach demonstrated improvements in 7 out of 12 cases. The 12 cases resulted from evaluating six different models (DeepLabV3+, ENet, FastFCN, HRNet, SegNet, and U-Net), each stored based on either their loss or IoU, and comparing the three-channel version to its four-channel counterpart. The minimum improvement was 0.66% for HRNet (loss), while the maximum was 13.64% for FastFCN (IoU). The maximum deterioration of 7.03% was observed in the U-Net model (IoU), with other deterioration levels ranging between 0.12% and 1.42%.

Table 4 presents the recall metrics. The proposed approach demonstrated segmentation completeness improvements in 9 out of 12 cases. The minimum improvement was 0.97% for DeepLabV3+ (IoU), while the maximum was 26.32% for ENet (loss). The maximum deterioration was 0.52% for HRNet (loss).

The table confirms that most models achieved higher recall metrics after adding the fourth channel.

Table 5 presents the F1 score results. As observed, the proposed approach demonstrated improvements for all models except for U-Net (loss), regardless of the saving strategy. The minimum improvement was 0.12% for HRNet (loss), and the maximum was 15.39% for ENet (loss).

The results confirm that incorporating the probability map contributed to an improved harmonic mean between precision and recall.

Table 6 shows the IoU results. The minimum improvement was 0.05% for HRNet (loss), while the maximum was 25.91% for ENet (loss). The U-Net model (loss) demonstrated deterioration of 1.14%.

The table demonstrates significant improvements in mask overlap for most models.

Table 7 presents the models’ losses. In nine cases, a reduction in loss was observed, corresponding to improvements in other metrics. The U-Net model showed a slight increase in loss. The YOLO11x-seg model was not compared due to the use of a custom loss function.

Most models demonstrated a reduction in losses after incorporating the fourth channel.

Table 8 shows the number of epochs required to save the best model. Adding the fourth channel had minimal impact on the training speed. For the DeepLabV3+ and ENet models, training even accelerated.

The data confirm that the proposed approach did not complicate the training process of the models.

Figure 3 presents examples of crack segmentation using three-channel and four-channel input data.

Adding the probability map significantly improved the results, especially for fine and thin cracks.

The results demonstrated quality improvements in segmentation for real examples.

4. Discussion and Conclusions

4.1. Analysis of the Impact of Probability Maps on Segmentation Quality

The results confirm that the proposed approach using crack probability maps as the fourth input channel for segmentation models led to improved quality metrics in most cases. The most significant positive impact was observed for the FastFCN and ENet models, where the F1 score increased by 10.95–15.52%, and the IoU improved by 15.74–25.91%. This highlights the effectiveness of the approach, especially for models which initially demonstrated lower segmentation accuracy.

The key findings for the impact of the proposed approach on various metrics include the following:

Precision improved in 7 out of 12 cases, with the maximum increase of 13.64% being found for FastFCN (IoU). However, deterioration of 7.03% was observed in the case of U-Net, indicating a reduced ability to filter false-positive segmentations.

A possible explanation for the reduced ability of certain models to filter false positives lies in the accumulation of errors from both the segmentation model and the classification ensemble. To verify this hypothesis, future work will focus on improving the classification model ensemble and exploring the possibility of integrating the probability map into deeper layers of the segmentation networks rather than only in the initial stages.

In the case of U-Net, the observed degradation might be attributed to its architectural characteristics, specifically the use of skip connections between the encoder and decoder layers. This design allows the probability map, which has a low resolution, to propagate directly to the output, potentially influencing the final segmentation results in an unintended manner:

Recall improved in 9 out of 12 cases, demonstrating the models’ enhanced ability to detect cracks. The highest increase of 26.32% was observed for ENet (loss), while the maximum deterioration (0.52%) was recorded for HRNet.
The F1 score, which balances precision and recall, improved for most models. Improvements were observed in all cases except for U-Net (loss).
The IoU increased for most models, confirming that probability maps helped improve the accuracy of the predicted masks.

Thus, the proposed approach achieved significant improvements for certain segmentation models, particularly FastFCN, ENet, and DeepLabV3+, while some metrics for U-Net showed deterioration.

4.2. Impact on Training Speed and Computational Efficiency

An analysis of the impact of the additional channel on the training process showed that using probability maps did not significantly increase the training time. In the case of DeepLabV3+ and ENet, training even accelerated, which can be explained by faster gradient stabilization due to the additional information from the probability maps.

For certain models, the loss function was minimized more quickly when four channels were used.

4.3. Visual Analysis of Results

A visual analysis of the examples (Figure 3) showed that the proposed approach allowed for more accurate identification of fine and thin cracks, which traditional three-channel models often ignore or segment with less confidence. This was particularly noticeable for DeepLabV3+, ENet, and FastFCN, which produced sharper crack contours with fewer false-positive regions.

However, for some models, such as U-Net, adding the probability map did not yield significant improvements and, in some cases, even worsened the segmentation quality. This may be due to the specific structure of U-Net, which may require adaptation of the training process to better utilize the additional information from the fourth channel.

4.4. Limitations and Potential Improvements

Despite the overall improvement in segmentation quality, this study has some limitations:

Dependence on the quality of probability maps: Since probability maps are created by a separate ensemble of classification models, their accuracy directly affects the segmentation results. Improving the classification ensemble may further enhance segmentation efficiency.
Potential generalization issues: The models were tested on a specific dataset. Additional experiments are needed to confirm the effectiveness of the method on other types of surfaces (e.g., asphalt or brick walls).
Optimization of architectures for working with probability maps: Some models (e.g., U-Net) may require modifications to more effectively utilize additional information from the fourth channel. This could include adding attention mechanisms or adaptive channel merging.

4.5. Conclusions and Future Research Directions

The proposed approach using probability maps improved the segmentation quality for most models, particularly in terms of metrics such as the IoU, F1 score, and recall. FastFCN, ENet, and DeepLabV3+ showed the most significant improvements, while U-Net demonstrated some cases of reduced effectiveness. Comparisons with the state-of-the-art YOLO11x-seg model showed that the proposed approach allowed SegNet (2015) to achieve results comparable to modern architectures, while other models significantly outperformed YOLO11x-seg. The visual results confirmed improved segmentation accuracy, especially for thin and fine cracks. Training time changes were minimal and, in some cases, even decreased.

Future research directions include the following:

Optimization of the classification ensemble: Using more advanced classification models or alternative ensemble strategies to create more accurate probability maps.
Extending testing to various surface types: Exploring the effectiveness of the approach for cracks in asphalt, metal structures, bricks, and other materials.
Modifying segmentation model architectures: Investigating the use of attention mechanisms or other methods to integrate probability maps more effectively for enhanced segmentation.

Author Contributions

Conceptualization, V.T., R.T. and Y.T.; methodology, V.T., Y.T. and R.T.; software, R.T. and V.T.; validation, Y.T. and R.T.; formal analysis, Y.T., R.T. and V.T.; investigation, V.T., Y.T. and R.T.; resources, Y.T. and R.T.; data curation, V.T. and Y.T.; writing—original draft preparation, V.T. and R.T.; writing—review and editing, R.T., Y.T. and V.T.; visualization, V.T. and R.T.; supervision, R.T.; funding acquisition R.T. and Y.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Figure A1. DeepLabV3+ architecture.

Figure A2. ENet model architecture.

Figure A3. FastFCN model architecture.

Figure A4. HRNet model architecture.

Figure A5. SegNet model architecture.

Figure A6. U-Net model architecture.

Figure A7. YOLOv11 architecture [39].

References

Casas, J.R. The Bridges of the Future or the Future of Bridges? Front. Built Environ. 2015, 1, 3. [Google Scholar] [CrossRef]
Zhang, G.; Liu, Y.; Liu, J.; Lan, S.; Yang, J. Causes and Statistical Characteristics of Bridge Failures: A Review. J. Traffic Transp. Eng. Engl. Ed. 2022, 9, 388–406. [Google Scholar] [CrossRef]
Mitoulis, S.A.; Domaneschi, M.; Cimellaro, G.P.; Casas, J.R. Bridge and Transport Network Resilience – a Perspective. Proc. Inst. Civ. Eng. Bridge Eng. 2022, 175, 138–149. [Google Scholar] [CrossRef]
Medvediev, K.; Kharchenko, A.; Stakhova, A.; Yevseichyk, Y.; Tsybulskyi, V.; Bekö, A. Methodology for Assessing the Technical Condition and Durability of Bridge Structures. Infrastructures 2024, 9, 16. [Google Scholar] [CrossRef]
Trach, R.; Moshynskyi, V.; Chernyshev, D.; Borysyuk, O.; Trach, Y.; Striletskyi, P.; Tyvoniuk, V. Modeling the Quantitative Assessment of the Condition of Bridge Components Made of Reinforced Concrete Using ANN. Sustainability 2022, 14, 15779. [Google Scholar] [CrossRef]
Pearson-Kirk, D. The Benefits of Bridge Condition Monitoring. Proc. Inst. Civ. Eng. Bridge Eng. 2008, 161, 151–158. [Google Scholar] [CrossRef]
Mugnai, F.; Bonora, V.; Tucci, G. Integration, Harmonization, and Processing of Geomatic Data for Bridge Health Assessment: The Lastra a Signa Case Study. Appl. Geomat. 2023, 15, 533–550. [Google Scholar] [CrossRef]
Semenyuk, M.; Trach, V.; Podvornyi, A. Stress-Strain State of Thick-Walled Anisotropic Cylindrical Shells under Thermal Power Load, Protected by the Functionally Graded Material. Strength Mater. Theory Struct. 2020, 105, 165–178. [Google Scholar] [CrossRef]
Pepe, M.; Costantino, D.; Crocetto, N.; Restuccia Garofalo, A. 3D modeling of roman bridge by the integration of terrestrial and uav photogrammetric survey for structural analysis purpose. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2019, XLII-2/W17, 249–255. [Google Scholar] [CrossRef]
American Association of State Highway and Transportation Officials (AASHTO). Survey Finds State DOTs Hiring Next-Gen Workforce to Run Drone Operations. Am. Assoc. State Highw. Transp. Off. 2019. Available online: https://transportation.org/pressreleases/survey-finds-state-dots-hiring-next-gen-workforce-to-run-drone-operations/ (accessed on 20 February 2025).
Oh, J.-K.; Jang, G.; Oh, S.; Lee, J.H.; Yi, B.-J.; Moon, Y.S.; Lee, J.S.; Choi, Y. Bridge Inspection Robot System with Machine Vision. Autom. Constr. 2009, 18, 929–941. [Google Scholar] [CrossRef]
Li, X.; Meng, Q.; Wei, M.; Sun, H.; Zhang, T.; Su, R. Identification of Underwater Structural Bridge Damage and BIM-Based Bridge Damage Management. Appl. Sci. 2023, 13, 1348. [Google Scholar] [CrossRef]
Żółtowski, M.; Żółtowski, B.; Ogrodnik, P.; Rutkowska, G.; Wierzbicki, T. Vibration Signal Diagnostic Information of Reinforced Masonry Elements Destruction. Appl. Sci. 2023, 13, 4913. [Google Scholar] [CrossRef]
Computer Vision. In Computer Science Handbook; Tucker, A.B., Ed.; Chapman and Hall/CRC: Boca Raton, FL, USA, 2004; pp. 1111–1136. ISBN 978-0-429-20939-0. [Google Scholar]
Trach, Y.; Chernyshev, D.; Biedunkova, O.; Moshynskyi, V.; Trach, R.; Statnyk, I. Modeling of Water Quality in West Ukrainian Rivers Based on Fluctuating Asymmetry of the Fish Population. Water 2022, 14, 3511. [Google Scholar] [CrossRef]
Thisanke, H.; Deshan, C.; Chamith, K.; Seneviratne, S.; Vidanaarachchi, R.; Herath, D. Semantic Segmentation Using Vision Transformers: A Survey. Eng. Appl. Artif. Intell. 2023, 126, 106669. [Google Scholar] [CrossRef]
Xu, Y.; Turkan, Y. BrIM and UAS for Bridge Inspections and Management. Eng. Constr. Archit. Manag. 2019, 27, 785–807. [Google Scholar] [CrossRef]
Trach, R.; Ryzhakova, G.; Trach, Y.; Shpakov, A.; Tyvoniuk, V. Modeling the Cause-and-Effect Relationships between the Causes of Damage and External Indicators of RC Elements Using ML Tools. Sustainability 2023, 15, 5250. [Google Scholar] [CrossRef]
Hawryszków, P.; Biliszczuk, J. Vibration Serviceability of Footbridges Made of the Sustainable and Eco Structural Material: Glued-Laminated Wood. Materials 2022, 15, 1529. [Google Scholar] [CrossRef]
Lantsoght, E.O.L. Advanced Structural Concrete Materials in Bridges. Materials 2022, 15, 8346. [Google Scholar] [CrossRef]
Seppälä, M.; Ilgın, H.E.; Karjalainen, M.; Pajunen, S. An Analysis on Finnish Wooden Bridge Practices. Appl. Sci. 2023, 13, 4325. [Google Scholar] [CrossRef]
Farhey, D. Material Structural Deficiencies of Road Bridges in the U.S. Infrastructures 2018, 3, 2. [Google Scholar] [CrossRef]
CVAT: Leading Image & Video Data Annotation Platform. Available online: https://www.cvat.ai/ (accessed on 20 February 2025).
Label Studio: Open Source Data Labeling. Available online: https://labelstud.io/ (accessed on 20 February 2025).
OpenCV - Open Computer Vision Library. Available online: https://opencv.org/ (accessed on 20 February 2025).
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.-C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar] [CrossRef]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.-C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for MobileNetV3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Ma, N.; Zhang, X.; Zheng, H.-T.; Sun, J. ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Identity Mappings in Deep Residual Networks. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016. [Google Scholar]
Paszke, A.; Chaurasia, A.; Kim, S.; Culurciello, E. ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation. arXiv 2019, arXiv:1606.02147. [Google Scholar]
Wu, H.; Zhang, J.; Huang, K.; Liang, K.; Yu, Y. FastFCN: Rethinking Dilated Convolution in the Backbone for Semantic Segmentation. arXiv 2019, arXiv:1903.11816. [Google Scholar]
Zhang, H.; Dana, K.; Shi, J.; Zhang, Z.; Wang, X.; Tyagi, A.; Agrawal, A. Context Encoding for Semantic Segmentation. arXiv 2018, arXiv:1803.08904. [Google Scholar]
Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X.; et al. Deep High-Resolution Representation Learning for Visual Recognition. arXiv 2019, arXiv:1908.07919. [Google Scholar] [CrossRef] [PubMed]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015; Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2015; Volume 9351, pp. 234–241. ISBN 978-3-319-24573-7. [Google Scholar]
Sapkota, R.; Qureshi, R.; Calero, M.F.; Badjugar, C.; Nepal, U.; Poulose, A.; Zeno, P.; Vaddevolu, U.B.P.; Khan, S.; Shoman, M.; et al. YOLO11 to Its Genesis: A Decadal and Comprehensive Review of The You Only Look Once (YOLO) Series. arXiv 2024, arXiv:2406.19407. [Google Scholar]
Sapkota, R.; Karkee, M. Comparing YOLOv11 and YOLOv8 for Instance Segmentation of Occluded and Non-Occluded Immature Green Fruits in Complex Orchard Environment. arXiv 2024, arXiv:2410.19869. [Google Scholar]

Figure 1. Examples of crack images (a,b) and masks (c,d). (a,c) Training set. (b,d) Validation set.

Figure 2. Examples of probability maps from each model in the ensemble and the combined probability map.

Figure 3. Examples of segmentation using three-channel images and four-channel images, with a probability map as an additional channel.

Table 1. Validation metrics for different classification models.

Models	Loss (Saved by Loss)	F1 Score (Saved by Loss)	Loss (Saved by F1 Score)	F1 Score (Saved by F1 Score)
MobileNetV2	0.1950	0.9234	0.3784	0.9346
MobileNetV3-Small	0.2232	0.9134	0.2708	0.9206
MobileNetV3-Large	0.2181	0.9085	0.4357	0.9322
AlexNet	0.3499	0.8433	0.6826	0.8681
ShuffleNet V2×0.5	0.2201	0.9095	0.3901	0.9254

Table 2. Hyperparameters for training segmentation models.

Hyperparameter	DeepLabV3+	ENet	FastFCN	HRNet	U-Net	SegNet	YOLO11x
Learning rate	0.0001
Loss function	BCEWithLogitsLoss						Custom
Optimizer	Adam						AdamW
Image size	640 × 640
Batch sizes	8	16	4	4	4	6	8

Table 3. Validation metrics for precision.

Model	Precision (Saved by Loss, 3 Channels)	Precision (Saved by Loss, 4 Channels)	Improvement (%)	Precision (Saved by IoU, 3 Channels)	Precision (Saved by IoU, 4 Channels)	Improvement (%)
DeepLab V3+	0.8659	0.9003	+3.96	0.8658	0.8858	+2.31
ENet	0.8398	0.8653	+3.03	0.8103	0.8301	+2.44
FastFCN	0.8497	0.9075	+6.81	0.7704	0.8754	+13.64
HRNet	0.9525	0.9587	+0.66	0.9482	0.9348	−1.42
SegNet	0.7300	0.7277	−0.33	0.7291	0.7282	−0.12
UNet	0.9563	0.9428	−1.42	0.9478	0.8811	−7.03
YOLO11x	0.7704

Table 4. Validation metrics for recall.

Model	Recall (Saved by Loss, 3 Channels)	Recall (Saved by Loss, 4 Channels)	Improvement (%)	Recall (Saved by IoU, 3 Channels)	Recall (Saved by IoU, 4 Channels)	Improvement (%)
DeepLab V3+	0.8553	0.8537	−0.19	0.8634	0.8717	+0.97
ENet	0.6168	0.7791	+26.32	0.6819	0.8263	+21.17
FastFCN	0.6045	0.6776	+12.09	0.7033	0.7711	+9.65
HRNet	0.6603	0.6569	−0.52	0.6956	0.7132	+2.53
SegNet	0.7045	0.7592	+7.76	0.7166	0.7734	+7.92
UNet	0.6732	0.6721	−0.16	0.7133	0.7856	+10.13
YOLO11x	0.7033

Table 5. Validation metrics for F1 score.

Model	F1 Score (Saved by Loss, 3 Channels)	F1 Score (Saved by Loss, 4 Channels)	Improvement (%)	F1 Score (Saved by IoU, 3 Channels)	F1 Score (Saved by IoU, 4 Channels)	Improvement (%)
DeepLab V3+	0.8590	0.8760	+1.98	0.8633	0.8783	+1.73
ENet	0.7101	0.8194	+15.39	0.7395	0.8277	+11.93
FastFCN	0.6887	0.7642	+10.95	0.7045	0.8138	+15.52
HRNet	0.7735	0.7744	+0.12	0.7962	0.8035	+0.92
SegNet	0.7078	0.7359	+3.98	0.7163	0.7463	+4.18
UNet	0.7858	0.7771	−1.11	0.8086	0.8219	+1.64
YOLO11x	0.7316

Table 6. Validation metrics for IoU.

Model	IoU (Saved by Loss, 3 Channels)	IoU (Saved by Loss, 4 Channels)	Improvement (%)	IoU (Saved by IoU, 3 Channels)	IoU (Saved by IoU, 4 Channels)	Improvement (%)
DeepLab V3+	0.7565	0.7804	+3.16	0.7629	0.7841	+2.77
ENet	0.5517	0.6946	+25.91	0.5873	0.7065	+20.31
FastFCN	0.5499	0.6365	+15.74	0.5719	0.7000	+22.4
HRNet	0.6389	0.6392	+0.05	0.6696	0.6793	+1.45
SegNet	0.5589	0.5887	+5.32	0.5661	0.6014	+6.25
UNet	0.6569	0.6494	−1.14	0.6882	0.7121	+3.47
YOLO11x	0.6181

Table 7. Validation losses.

Model	Loss (Saved by Loss, 3 Channels)	Loss (Saved by Loss, 4 Channels)	Improvement (%)	Loss (Saved by IoU, 3 Channels)	Loss (Saved by IoU, 4 Channels)	Improvement (%)
DeepLab V3+	0.0198	0.0194	−1.88	0.0203	0.0196	−3.86
ENet	0.0385	0.0282	−26.78	0.0403	0.0287	−28.86
FastFCN	0.0413	0.0321	−22.13	0.0525	0.0392	−25.3
HRNet	0.0338	0.0336	−0.65	0.0340	0.0359	+5.62
SegNet	0.0443	0.0377	−15.03	0.0449	0.0380	−15.52
UNet	0.0315	0.0316	+0.19	0.0318	0.0344	+8.12

Table 8. Epoch number for saving the best model.

Model	Epoch (Saved by Loss, 3 Channels)	Epoch (Saved by Loss, 4 Channels)	Improvement (%)	Epoch (Saved by IoU, 3 Channels)	Epoch (Saved by IoU, 4 Channels)	Improvement (%)
DeepLab V3+	53	43	−18.87	67	51	−23.88
ENet	341	255	−25.22	377	275	−27.06
FastFCN	83	24	−71.08	68	90	+32.35
HRNet	60	77	+28.33	70	85	+21.43
SegNet	58	60	+3.45	52	103	+98.08
UNet	59	70	+18.64	70	74	+5.71
YOLO11x	473

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tyvoniuk, V.; Trach, R.; Trach, Y. Integration of Probability Maps into Machine Learning Models for Enhanced Crack Segmentation in Concrete Bridges. Appl. Sci. 2025, 15, 3201. https://doi.org/10.3390/app15063201

AMA Style

Tyvoniuk V, Trach R, Trach Y. Integration of Probability Maps into Machine Learning Models for Enhanced Crack Segmentation in Concrete Bridges. Applied Sciences. 2025; 15(6):3201. https://doi.org/10.3390/app15063201

Chicago/Turabian Style

Tyvoniuk, Volodymyr, Roman Trach, and Yuliia Trach. 2025. "Integration of Probability Maps into Machine Learning Models for Enhanced Crack Segmentation in Concrete Bridges" Applied Sciences 15, no. 6: 3201. https://doi.org/10.3390/app15063201

APA Style

Tyvoniuk, V., Trach, R., & Trach, Y. (2025). Integration of Probability Maps into Machine Learning Models for Enhanced Crack Segmentation in Concrete Bridges. Applied Sciences, 15(6), 3201. https://doi.org/10.3390/app15063201

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Integration of Probability Maps into Machine Learning Models for Enhanced Crack Segmentation in Concrete Bridges

Abstract

1. Introduction

2. Methods

2.1. Overview of the Input Data

2.2. Ensemble of Models for Classification and Generation of Crack Probability Maps

2.2.1. Preliminary Analysis of Classification Models

2.2.2. Training and Comparison of Classification Models

2.2.3. Formation of an Ensemble Model for Generating Crack Probability Maps

2.2.4. Dataset Generation with a Fourth Channel

2.3. Image Segmentation Models

2.4. Training Parameters for Models

2.5. Evaluation Metrics

3. Results

4. Discussion and Conclusions

4.1. Analysis of the Impact of Probability Maps on Segmentation Quality

4.2. Impact on Training Speed and Computational Efficiency

4.3. Visual Analysis of Results

4.4. Limitations and Potential Improvements

4.5. Conclusions and Future Research Directions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI