2.1. Overview of the Input Data
This research focuses on improving the segmentation of defects in bridge structures, though the proposed approach can also be applied to other buildings and structures made of concrete and reinforced concrete. Concrete and steel are the primary materials used for bridge construction, with wood being used less frequently [
19]. Globally, bridges are predominantly constructed from reinforced and prestressed concrete [
5,
20]. Approximately 67% of all bridges in Finland are either concrete or contain concrete elements [
21]. In the United States, 254,965 out of 611,833 bridges are made of concrete [
22]. Given this prevalence, this research focused on detecting concrete defects, particularly cracks.
The images used for training and validation were sourced from open datasets and supplemented with photographs taken by the authors on bridge structures. The images were resized and cropped to 1280 × 1280 to prevent distortion and blurring of fine cracks when fed into segmentation models. Annotation was carried out using tools such as Computer Vision Annotation Tool (CVAT) version 2.30.0 [
23] and Label Studio version 1.15.0 [
24]. These open-source tools support the annotation of images in various formats for classification, object detection, and segmentation tasks.
All images in the dataset contain cracks, as the primary objective of this study was crack segmentation rather than classification. Thus, the presence of crack-free images was not considered crucial for evaluating segmentation performance. However, the dataset includes images where cracks coexist with other structural defects, reflecting real-world conditions where multiple types of damage may be present simultaneously. This ensured the model’s robustness in distinguishing cracks even in complex scenarios.
After annotation, the segmentation masks were exported in PNG format. For You Only Look Once (YOLO11x-seg), these masks were converted using the Open Computer Vision Library (OpenCV) version 4.11.0.86 [
25], which identified all contour points of the masks. The coordinates of these points were normalized by dividing them by the width and height of the image and then adding them to a list.
The dataset was divided into training and validation sets at a 70% to 30% ratio, with 329 images in the training set and 141 images in the validation set, totaling 470 images. The split was performed randomly to ensure that the training and validation sets contained a diverse representation of crack patterns and structural conditions. Random splitting helps prevent biases which can arise from manual selection and ensures that the model generalizes well across different test scenarios. These datasets were used across all machine learning models for segmentation. Examples of the training and validation sets, along with their annotations (masks), are shown in
Figure 1.
For training classification models, the training and validation datasets were divided into fragments 32 × 32, 64 × 64, 128 × 128, and 256 × 256 in size. No overlap was applied during fragmentation.
The selection of fragments was performed using OpenCV, where the full image mask was analyzed to identify crack regions. Initially, fragments containing cracks, where at least 10% of the pixels belonged to a crack, were extracted for each scale. Then, an equal number of randomly selected fragments from the same image, containing no crack pixels, were added to balance the dataset.
The final dataset sizes were as follows:
Training set: 11,266 fragments (5633 with cracks, 5633 without cracks);
Validation set: 9018 fragments (4509 with cracks, 4509 without cracks).
This approach ensured a balanced distribution of crack and non-crack samples while preventing bias and maintaining consistency across different fragment sizes.
2.2. Ensemble of Models for Classification and Generation of Crack Probability Maps
2.2.1. Preliminary Analysis of Classification Models
Binary classification is one of the fundamental tasks in computer vision, involving the determination of whether a given object is present in an image or not. In this study, lightweight CNN models were considered for classification:
MobileNetV2;
MobileNetV3-Small;
MobileNetV3-Large;
AlexNet;
ShuffleNet V2×0.5.
The selection of these models was motivated by the goal of minimizing their impact on the segmentation time.
The MobileNetV2 architecture, introduced in 2018 as an improvement over MobileNetV1, was designed for efficient image recognition on mobile and low-power devices [
26].
The key features of MobileNetV2 are as follows:
Inverted Residuals: Instead of the traditional residual connections used in ResNet, the authors proposed an inverted residual structure. Classical residual blocks operate with the following sequence: deep layer → 3 × 3 convolution → channel expansion → 1 × 1 convolution. In contrast, the inverted residual block reverses the order. First, the channels are expanded (1 × 1 convolution), followed by depthwise convolution (3 × 3). Finally, the channels are compressed (1 × 1 convolution).
Linear Bottlenecks: MobileNetV2 introduces the concept of linear bottlenecks, where no ReLU activation is applied at the narrowest point (after channel compression). This approach prevents information loss which may occur due to the ReLU function.
Depthwise Separable Convolution: Similar to MobileNetV1, this model employs depthwise separable convolution. However, these operations are further optimized by combining depthwise convolutions with channel expansion.
Expansion Factor: Each block initially expands the channels sixfold before applying convolution, allowing the model to retain more information.
The MobileNetV3 model, introduced in 2019, builds upon MobileNetV2 by incorporating automated architecture search (NAS), Hard Swish activations, and Squeeze-and-Excitation blocks. MobileNetV3 is available in two versions—large and small—designed for different resource constraints [
27].
Automated architecture search enabled the identification of optimal parameter combinations for balancing performance and energy efficiency. Unlike ReLU6, which in used in MobileNetV2, the new architecture employs Hard Swish (h-swish), a smoother non-linear activation function which helps prevent gradient vanishing. Although ReLU is still used in some layers, h-swish is predominantly applied. Another improvement is the inclusion of Squeeze-and-Excitation (SE) blocks, which dynamically adjust channel weights to enhance feature selection.
The primary difference between the large and small versions lies in the number of standard 3 × 3 convolutional layers (both standard and depthwise) and the complexity of operations. The large version includes more complex operations, whereas the small version replaces them with lighter ones to suit lower-resource environments.
Although AlexNet model was introduced in 2012 [
28], its architecture, consisting of five convolutional layers (some accompanied by max-pooling layers) and three fully connected layers, remains relatively simple yet effective for binary classification tasks. The network comprises approximately 60 million parameters and 650,000 neurons. ReLU is used as the activation function, which ensures nonlinear transformations and efficient gradient propagation. To prevent overfitting, the dropout technique is applied in the fully connected layers, randomly deactivating a portion of the neurons during training.
Despite its simplicity compared with more recent models, this architecture can serve as a robust baseline for binary classification, particularly when computational resources are limited or when working with moderately sized datasets. Its design highlights the balance between computational efficiency and classification accuracy, making it a valuable choice in specific scenarios.
ShuffleNet V2 is an enhanced version of the ShuffleNet architecture designed to improve the efficiency of convolutional neural networks on mobile devices. The architecture was introduced in 2018 [
29].
ShuffleNet V1, introduced in 2017 [
30], was developed as a lightweight model optimized for mobile devices by utilizing group convolutions and channel shuffling to improve computational efficiency. The key innovation of ShuffleNet V1 was its ability to reduce computational complexity while maintaining high accuracy by introducing pointwise group convolutions and channel shuffle operations, which helped facilitate information exchange across feature maps.
One of the primary changes in ShuffleNet V2 is the elimination of group convolutions, which were found to be memory-intensive and reduce parallelism, thereby decreasing computational speed. Another distinctive feature of this model is the use of the “channel split” operation. At the beginning of each block, the channels are divided into two branches; one branch remains unchanged (identity), while the other undergoes a series of convolutions. The results from both branches are then combined and shuffled, enabling efficient information exchange between the channels.
The model was introduced in four variants, with the smallest (0.5×) containing 1.4 million parameters and the largest (2×) containing 7.4 million parameters, allowing users to select an appropriate balance between performance and computational complexity.
2.2.2. Training and Comparison of Classification Models
The training and validation of the considered models were conducted using the dataset described in
Section 2.1. All fragments, regardless of their initial size, were resized to a uniform dimension of 64 × 64. To prevent overfitting and underfitting, all models utilized an early stopping mechanism based on the loss metric, ensuring the models were saved at the lowest loss value and highest F1 score.
The following hyperparameters were employed during training:
Optimizer: AdamW was chosen due to its improved stability and efficiency in updating weights compared with standard Adam. The inclusion of weight decay in AdamW helps mitigate overfitting.
Learning Rate: We selected 0.0001 empirically after the preliminary experiments. Higher values (e.g., 0.001) caused training instability, while lower values (e.g., 0.00001) significantly slowed convergence.
Weight Decay: We used 0.01 for model regularization to prevent overfitting. The value 0.01 was deemed to be optimal based on the validation set performance.
Loss Function: BCELoss was selected due to its suitability for binary classification tasks, such as detecting the presence of cracks at the pixel level.
Batch Size: We experimentally determined the batch size to be 64. Both smaller and larger values resulted in poorer validation performance, making 64 the optimal choice for stable training and efficient convergence.
After training, the models were compared using the loss and F1 score metrics. This analysis included five models (three models from the MobileNet family) following a consistent training and validation scheme, allowing for evaluation of their performance under identical conditions.
2.2.3. Formation of an Ensemble Model for Generating Crack Probability Maps
Based on the training results, the MobileNetV2 model achieved the best validation metrics, followed by MobileNetV3 and ShuffleNetV2 × 0.5. While ShuffleNet demonstrated better loss performance compared with the small version of MobileNetV3, it performed worse than the large version. Conversely, the F1 score metric for ShuffleNet was slightly better than the large version and slightly worse than the small version. Consequently, MobileNetV2 was chosen as the primary model for the ensemble, complemented by ShuffleNetV2 × 0.5 for small-sized patches, which was chosen as a balance between these two models. For patches of sizes 64 × 64 and 128 × 128, MobileNetV2 trained with the F1 score metric was used, as the primary task at these scales was to reduce false negatives and false positives.
Fragments of different sizes were considered, ranging from 8 × 8 to 256 × 256. However, fragments smaller than 32 × 32 contained too little information, leading to poor classification results. Additionally, using fragments larger than 128 × 128 reduced the accuracy of the probability maps. The use of 256 × 256 fragments did not improve the results and instead introduced excessive smoothing, reducing segmentation quality. The final choice of fragment sizes (32 × 32, 64 × 64, and 128 × 128) was made based on extensive testing to balance detail preservation and computational efficiency.
Table 1 presents the validation metrics for all considered models. The models were saved based on both the lowest loss and highest F1 score to ensure a fair comparison.
All models were tested with all fragment sizes simultaneously. The metrics presented in
Table 1 are the validation scores for all fragments combined. While no separate evaluation was conducted per fragment size, small-scale testing on the selected datasets confirmed that the validation results aligned with the probability map quality. However, individual models performed significantly worse than the ensemble approach. Two models (MobileNetV2 and ShuffleNetV2 × 0.5) were selected for the 32 × 32 fragments because models at this scale often struggle with false detections due to concrete texture variability. Using two models helped mitigate this issue by leveraging the strengths of each.
Various weighting schemes were tested for combining model outputs. However, using unequal weight coefficients led to unstable results, where a model performed well in certain conditions but worse in others. Consequently, equal weighting was chosen to ensure consistent probability map quality in all cases. Similarly, different scaling factors were explored, but they either introduced excessive noise or reduced the probability map accuracy. Ultimately, the selected approach involved multiplying the probability maps by 0.25 and summing the resulting maps to achieve a stable and reliable fusion.
The overall ensemble structure is as follows:
MobileNetV2 (loss), 32 × 32;
ShuffleNetV2×0.5 (loss), 32 × 32;
MobileNetV2 (F1), 64 × 64;
MobileNetV2 (F1), 128 × 128.
Figure 2 illustrates the visualization of probability maps from each model in the ensemble and the combined probability map of a crack’s presence.
As shown, the use of two models for classifying 32 × 32 fragments and two models for larger fragments effectively suppressed false positive detections while preserving the detail of the probability map.
This ensemble selection strategy ensured a balance between classification accuracy and computational efficiency, reducing false positives for small fragments while maintaining high-quality probability maps for crack detection.
2.2.4. Dataset Generation with a Fourth Channel
The input images in the dataset were divided into fragments of sizes 32 × 32, 64 × 64, and 128 × 128, with a step size equal to half the fragment size. These fragments were then classified using an ensemble of models, and the classification results were stored in the alpha channel of the PNG file. High confidence for the “crack” class was represented by maximum opacity, while low confidence corresponded to full transparency. Importantly, the information in the RGB channels was preserved, enabling segmentation models to adapt to the fourth channel rather than relying solely on the remaining RGB data post-classification.
No additional post-processing was applied.
2.3. Image Segmentation Models
To evaluate the effectiveness of the proposed approach, this study considered lightweight models for real-time segmentation (ENet), well-established architectures such as U-Net and SegNet, as well as relatively recent models, including DeepLabV3+, FastFCN, and HRNet. Additionally, a comparison was conducted with the state-of-the-art real-time segmentation model YOLO11x-seg.
The DeepLabV3+ model was introduced in 2018 as a modification of DeepLabV3 [
31] for semantic segmentation. The primary improvement was the addition of a simple yet effective decoder to refine the segmentation results, particularly along object boundaries.
In addition, the model enhanced the Atrous Spatial Pyramid Pooling (ASPP) module, which captures multi-scale contextual information by applying atrous convolution with varying rates.
The architecture of the model used in this study is schematically presented in
Figure A1.
Figure A1 shows the DeepLabV3+ architecture, illustrating the role of the encoder, ASPP module, and decoder in producing refined segmentation masks. The encoder is based on a ResNet50 [
32] backbone, while the ASPP module captures features at different scales, and the decoder improves the spatial resolution for object boundary delineation.
The ENet model was introduced in 2016 for real-time semantic segmentation tasks [
33]. The model operates 18 times faster and has 79 times fewer parameters than the state-of-the-art models of its time and achieves comparable accuracy. A key feature of the model is its use of a large encoder and a small decoder. According to the authors, the encoder should mimic the behavior of original architectures by processing low-resolution data, while the decoder’s role is to upsample the encoder’s output and refine the details.
Another notable feature is the removal of most ReLU activations in the initial layers. Additionally, the authors proposed replacing ReLU with PReLU, which allows negative activations to have nonzero values.
The architecture of the ENet model is schematically shown in
Figure A2.
Figure A2 illustrates the ENet architecture, highlighting the structure of its encoder and decoder. The encoder processes input images at reduced resolutions to efficiently extract features, while the lightweight decoder progressively restores spatial details in the segmentation output.
The FastFCN architecture was introduced in 2019 [
34]. Its main innovation lies in replacing dilated convolutions in the backbone with standard convolutions, thereby restoring the original architecture, such as ResNet, without dilation. Instead of dilated convolutions, the authors proposed a novel approach called Joint Pyramid Upsampling (JPU).
The JPU architecture employs a feature pyramid approach, gathering features from multiple layers (e.g., layers 3, 4, and 5 in ResNet). To increase the resolution of these features, upsampling is performed using three parallel 3 × 3 convolutions with different stride levels. Multi-scale features are then combined using channel-wise concatenation. According to the original publication, this method enables the extraction of global features with lower computational costs compared with traditional methods.
Additionally, this model incorporates the Context Encoding Module (CEM) [
35] to capture global context and enhance semantic understanding. This reduces dependency on local noise and improves model stability.
The architecture used in this study is schematically shown in
Figure A3.
Figure A3 illustrates the FastFCN architecture, highlighting the integration of JPU for multi-scale feature extraction and CEM for global context representation. The combination of these modules enhances segmentation accuracy while maintaining computational efficiency.
The High-Resolution Network (HRNet) was introduced in 2019 [
36]. The architecture is designed to maintain high-resolution representations throughout the entire processing pipeline.
The key features of HRNet include the following:
Parallel branches with varying resolutions: Unlike traditional architectures which progressively reduce the resolution, HRNet maintains multiple parallel branches with different resolutions.
Multi-level information fusion: A continuous exchange of information between branches enables the generation of more precise representations.
High accuracy and spatial expansion: By preserving the resolution, HRNet ensures more accurate representations in tasks requiring precise spatial information.
The architecture used in this study is schematically shown in
Figure A4.
Figure A4 illustrates the HRNet architecture, emphasizing its multi-resolution branches and the fusion process, which enables accurate and detailed spatial representations for segmentation tasks.
The SegNet architecture was introduced in 2015 [
37] as an efficient neural network for semantic segmentation, targeting mobile applications and real-time computation. The primary innovation of this model lies in its encoder-decoder structure with a unique upsampling mechanism.
The key features of SegNet include the following:
Encoder-decoder architecture: The encoder consists of 13 convolutional layers, analogous to the first 13 layers of the VGG16 network, and utilizes max pooling to progressively reduce the feature dimensions. The decoder, in turn, is responsible for restoring the resolution of the segmented image.
Index-based upsampling: The core innovation of SegNet is the use of stored max-pooling indices from the encoder during upsampling in the decoder. This allows upsampling without the need to train additional parameters, significantly reducing memory requirements and improving the model speed.
Memory and computational efficiency: Thanks to its architecture, SegNet uses fewer parameters compared with other deep segmentation networks (e.g., FCN or DeepLab), making it suitable for deployment on devices with limited computational resources.
Segmentation accuracy: Restoring the resolution via pooling indices helps preserve object details, which is crucial for semantic segmentation tasks that require pixel-level precision.
The architecture used in this study is schematically shown in
Figure A5.
Figure A5 illustrates the SegNet architecture, highlighting its encoder-decoder structure and index-based upsampling mechanism, which contribute to its efficiency and accuracy in segmentation tasks.
The U-Net model was introduced in 2015 [
38] as an architecture for semantic segmentation specifically designed for processing medical images. Due to its efficiency, it has become popular for a wide range of segmentation tasks, including satellite imagery, agronomy, biological analysis, and material defect detection.
The key features of U-Net include the following:
Symmetrical encoder-decoder architecture: The encoder consists of convolutional layers, each followed by max pooling to reduce the image dimensions and extract key features. The decoder gradually increases the resolution using transposed convolutions to restore the pixel representation of segmented objects.
Skip connections: A defining feature of U-Net is the presence of skip connections between the encoder and decoder. These connections allow high-resolution information from early encoder layers to be directly passed to corresponding decoder layers, significantly improving the accuracy of detail recovery.
Preservation of object boundaries and precise localization: This strategy helps retain the contours of objects and precise localization, making the model highly effective for tasks which require recognition of fine details.
High accuracy on small datasets: Due to its multi-level processing and efficient use of spatial information, U-Net delivers excellent results even on small datasets.
The architecture used in this study is schematically shown in
Figure A6.
Figure A6 illustrates the U-Net architecture, emphasizing its symmetrical encoder-decoder structure and the use of skip connections to enhance segmentation accuracy and detail preservation.
The YOLOv11 model is the latest iteration in the “You Only Look Once” (YOLO) series developed by Ultralytics. It continues the tradition of providing high speed and accuracy in computer vision tasks. In the updated architecture, the C2f block from previous versions has been replaced with a C3k2 block. Additionally, the model retains the SPPF component and introduces the C2PSA module, which enhances feature extraction [
39].
Other improvements in YOLOv11 [
40] include the following:
Enhanced feature extraction;
Optimized speed and efficiency;
Improved accuracy with fewer parameters.
In this study, the largest version of the model, YOLO11x-seg, was utilized.
The architecture used in this study is schematically shown in
Figure A7.
Figure A7 illustrates the YOLOv11 architecture, emphasizing its advanced components, including the C3k2 block and the C2PSA module, which collectively enhance its feature extraction and segmentation capabilities while maintaining efficiency.
2.5. Evaluation Metrics
All models were evaluated using consistent metrics, with additional comparisons made against the original YOLO11x-seg model. The machine learning models in this study were assessed based on the following metrics: precision, recall, F1 score, IoU, and loss. Additionally, the epoch at which the best-performing model was saved was recorded.
Precision in segmentation tasks measures the proportion of correctly predicted pixels belonging to a class out of the total number of pixels predicted for that class:
where TP (true positive) is the number of pixels correctly classified as belonging to the target class and FP (false positive) is the number of pixels incorrectly classified as belonging to the target class.
Recall measures the proportion of actual pixels belonging to a class which were correctly identified by the model. This metric reflects the model’s sensitivity to detecting pixels of a given class:
where FN (false negative) is the number of pixels which were missed and classified as not belonging to the class.
The F1 score is the harmonic mean of precision and recall, accounting for both false positives and false negatives:
Here, the intersection over union (IoU), or the Jaccard Index, measures the degree of overlap between the predicted mask and the ground truth mask:
where Intersection is the number of pixels belonging to both the predicted mask and the ground truth mask and Union is the total number of pixels belonging to at least one of the masks (either predicted or ground truth).