3.1. Datasets
For crack detection, several datasets have been created to assess the performance of different algorithms. This paper employs three datasets, detailed as follows:
Collected in Beijing with an iPhone 5, the CFD dataset comprises 118 RGB road images (480 × 320 pixels), including 250 training, 50 validation, and 200 test images. It incorporates noise elements like oil stains, shadows, and water stains, focusing solely on road surface texture and cracks while excluding unrelated objects (e.g., garbage, cars). The diverse noise and environments pose challenges for crack detection algorithms, simulating real urban road conditions.
Comprising 500 images (resolution near 2000 × 1500 pixels) taken via mobile phones on Temple University’s campus, the Crack500 dataset adapts to computational limits by dividing each image into 16 non-overlapping parts. Only parts with over 1000 crack pixels are retained. After meticulous pixel-level annotation, it now includes 3368 crack images.
With 537 crack images, DeepCrack is characterized by complex backgrounds and multi-scale cracks, covering three textures (bare, dirty, rough) and two scenarios (concrete, asphalt). Cracks vary in width (1–180 pixels), with small crack areas in each image—reflecting real conditions. Hand-annotated, they form binary representations.
These datasets offer a comprehensive evaluation of the SECrackSeg model across common real-world scenarios. Their diversity and complexity ensure thorough testing of the model’s performance.
3.2. Experimental Setup
Experiments were conducted on a Windows 10 system with an RTX 2070 Super GPU. We used Python 3.8.20, achieving GPU acceleration through CUDA 11.8 and CUDNN 8.9, implementing deep learning based on the PyTorch framework.
To comprehensively evaluate the semantic segmentation model, key metrics—Precision, Recall, F1-Score, and Mean Intersection over Union (mIoU)—were adopted. These metrics evaluate the model from multiple perspectives, helping to understand its advantages and disadvantages in different tasks. Their definitions and formulas are as follows:
Precision
Precision measures the proportion of truly positive samples among those predicted as positive by the model. The formula is:
Here, represents True Positives (pixels predicted as positive and actually positive), and is False Positives (pixels predicted as positive but actually negative).
Recall
Recall evaluates the model’s ability to identify all actual positive samples, defined by:
where
denotes False Negatives (pixels predicted as negative but actually positive).
F1-Score
F1-Score, the harmonic mean of Precision and Recall, considers both prediction precision and coverage, suitable for imbalanced class scenarios. Its formula is:
Mean Intersection over Union (mIoU)
mIoU measures model performance across all classes by averaging the Intersection over Union (IoU) for each class. For the
i-th class, IoU is:
With
,
,
as True Positives, False Positives, and False Negatives for the
i-th class. mIoU is:
where
C is the number of classes.
These metrics reflect the model’s strengths and weaknesses from different angles. Using them together enables a comprehensive performance evaluation, providing a basis for model optimization and selection.
3.3. Evaluation of SECrackSeg
To demonstrate the effectiveness of our proposed model, we used several mainstream crack segmentation methods as our baselines.
Before the comparison, we provided some relevant explanations for these mainstream algorithms:
Table 1 illustrates the performance comparison of various cutting-edge methods on the CFD dataset, a relatively small dataset. Given its limited sample quantity, the CFD dataset poses significant challenges for models aiming to achieve high performance. With insufficient data for training and validation, many models struggle to generalize well, often leading to overfitting or poor performance on unseen data. However, our proposed SECrackSeg model shows remarkable adaptability in this low-data scenario.
These methods are evaluated via four metrics: Precision (P), Recall (R), F1-Score (F1), and Mean Intersection over Union (mIoU). Results show that SECrackSeg outperforms other methods in all four metrics. Specifically, SECrackSeg achieves a Precision of 0.895, Recall of 0.938, F1-Score of 0.915, and mIoU of 0.854. This outstanding performance can be attributed to its innovative architecture, which includes components like the SAM2 S-Adapter. The S-Adapter allows the model to leverage the pre-trained knowledge of the large-scale SAM2 model while adapting to the specific characteristics of the small-sample CFD dataset. It effectively extracts multi-scale features related to cracks, enhancing the model’s ability to accurately segment cracks even with limited data.
The second-best method is DeepCrack, with a Precision of 0.809, Recall of 0.871, F1-Score of 0.837, and mIoU of 0.751. Other methods, including UNet, UNet++, DeeplabV3+, OCRNet, AttentionUnet, and CT-crackseg, display relatively lower performance across these metrics. Notably, CT-crackseg has a high Recall of 0.897 but a relatively low Precision of 0.717, indicating it effectively identifies positive samples while having a higher false positive rate.
Overall, the results highlight SECrackSeg’s superior performance in crack segmentation tasks on the CFD dataset, even with its limited sample size. This not only validates the effectiveness of the proposed model but also demonstrates its potential for applications where data is scarce, such as in-field inspections of infrastructure where collecting a large number of samples may be difficult or costly.
To visually demonstrate the performance of our method, we selected a sample image from the CFD dataset, as shown in
Figure 5. Our method shows significant improvements in delineating the crack boundaries more accurately and capturing finer details compared to other methods. This visual comparison highlights the effectiveness of our proposed method in crack segmentation tasks.
To further assess the performance of the proposed SECrackSeg model, experiments were conducted on the Crack500 dataset. As a relatively large-scale dataset, Crack500 allows for a comprehensive evaluation of various methods. Results are presented in
Table 2. As shown, SECrackSeg achieves the highest values in all four metrics, highlighting its superior performance in crack segmentation tasks. Specifically, it reaches a Precision of 0.895, Recall of 0.890, F1-Score of 0.892, and mIoU of 0.838—values significantly higher than other methods. Notably, though SECrackSeg’s Recall is slightly lower than CT-crackseg (0.908) and DeepCrack (0.895), suggesting room for improvement in specific aspects, it still dominates in overall performance.
As shown in
Table 3, on the DeepCrack dataset, SECrackSeg achieves the highest values in all four metrics, demonstrating its superior performance in crack segmentation. Specifically, it reaches a Precision of
, Recall of
, F1-Score of
, and mIoU of
—significantly higher than other methods. Notably, though SECrackSeg’s recall is slightly lower than that of CT-crackseg (
) and DeepCrack (
), it still excels in overall performance. The DeepCrack dataset, with its variety of crack types and scenarios, enables a comprehensive performance evaluation of the model. Results show SECrackSeg has a strong ability to handle different crack types, making it a reliable choice for crack segmentation tasks.
3.3.1. Ablation Study on the Middle Dimension of S-Adapter
To evaluate the impact of the internal bottleneck dimension in the SAM2 S-Adapter, we conducted an ablation study by varying the middle dimension across several settings (16, 64), while keeping the input and output dimensions fixed according to the SAM2 Hiera encoder.
In the S-Adapter module, the dimensional configurations of the linear layers play a crucial role. The linear down-projection layer, denoted as , compresses the input features from the dimension to the dimension. This reduces the feature complexity and computational cost, allowing the model to focus on the key information. The linear up-projection layer, , then restores the dimension of the compressed and transformed features back to the dimension, which is typically set to be equal to to ensure compatibility with subsequent network layers.
Here:
and are fixed and determined by the corresponding block of the Hiera encoder. These values are set to the number of channels at each feature scale, depending on which encoder stage the adapter is inserted.
is the only tunable hyperparameter. It represents the projection dimension within the adapter and controls its transformation capacity. A larger means that the linear down-projection and up-projection layers have more parameters, which can potentially introduce more expressive power. However, this may also increase the risk of overfitting and the training cost, as more complex transformations might lead the model to memorize the training data rather than learning general patterns.
The experimental results are summarized in
Figure 6, showing how different values of
affect the performance of SECrackSeg on the CFD dataset:
Precision: Increases with dimension increase. It reaches 0.895 at 64 dimensions. The increase shows that the linear layers in the S-Adapter can capture more complex feature relationships.
Recall: Peaks at 0.938 with 64 dimensions. When changes from 16 to 64, Recall improves, but a further increase might lead to overfitting issues as seen in the potential drop at higher values.
F1-Score: Reaches a maximum of 0.915 at 64. This indicates a good balance in feature transformation at this dimension.
mIoU: Also reaches its highest value of 0.854 at 64 dimensions. The performance at this dimension shows the model’s ability to segment cracks accurately across all classes.
Figure 6.
Impact of SAM2 S-Adapter middle dimension on SECrackSeg performance on the CFD dataset.
Figure 6.
Impact of SAM2 S-Adapter middle dimension on SECrackSeg performance on the CFD dataset.
These results demonstrate that while increasing the internal bottleneck dimension can enhance the learning of complex feature transformations, it may also reduce generalization, especially on small datasets. The 64-dimensional setting provides a good trade-off between Precision and Recall, enabling accurate and robust crack segmentation.
In conclusion, is selected as the optimal configuration in the final model. This setting ensures that the adapter modules’ linear layers can effectively adapt pretrained SAM2 features while maintaining high generalization in data-scarce conditions.
3.3.2. Ablation Study on the MSDC Module
As shown in
Figure 7, ablation experiments were conducted on the CFD dataset with different configurations of the MSDC module. The results are summarized in
Table 4.
The MSDC module significantly enhances model performance, particularly in multi-scale feature extraction and segmentation accuracy. Looking at the results in detail, for the Precision metric, it increases steadily from 0.832 in MSDC(a) to 0.895 in MSDC(e), indicating that the model becomes more accurate in identifying truly positive samples as the number of convolutional layers and dilation rates increase. Recall also shows a rising trend, from 0.775 in MSDC(a) to 0.938 in MSDC(e), meaning the model is better at recognizing all actual positive samples. The F1-Score and mIoU follow similar patterns, with continuous improvements.
The substantial improvements from MSDC(a) to MSDC(d) clearly demonstrate the module’s effectiveness in capturing features at different scales, which is crucial for accurate segmentation. The slight improvement from MSDC(d) to MSDC(e) indicates that the module’s design is highly optimized, and additional layers may not significantly boost performance.
In practical applications such as structural health monitoring, cracks of different sizes pose varying threats to the integrity of structures. The MSDC module’s ability to accurately segment cracks of different scales is of great significance. It enables the detection of both small and large cracks, helping engineers to identify potential safety hazards in a timely manner and make appropriate maintenance decisions. Overall, the results show that the MSDC module effectively improves the model’s ability to capture multi-scale features, thereby enhancing segmentation accuracy on the CFD dataset.
3.3.3. Ablation Study on the MI-Upsampling Module
To comprehensively evaluate the impact of the MI-Upsampling module in the SECrackSeg model, we conducted ablation experiments comparing it with traditional upsampling methods, including bilinear interpolation and transposed convolution. The results are shown in
Table 5.
As shown in
Table 5, the MI-Upsampling module achieves the best performance across all evaluation metrics—Precision, Recall, F1-Score, and mIoU. While the absolute improvements over bilinear interpolation and transposed convolution may appear relatively small, these gains are consistent and indicate meaningful enhancements in edge prediction and spatial detail preservation.
Notably, the Recall improves from 0.912 (bilinear) and 0.925 (transposed convolution) to 0.938 with MI-Upsampling, reflecting better crack completeness. Meanwhile, the F1-Score rises to 0.915 and mIoU to 0.854, confirming improved balance between Precision and Recall.
Beyond numerical performance, MI-Upsampling is explicitly designed to address two critical limitations of traditional upsampling: (1) information loss and (2) the generation of pseudo-edges. It integrates slice-based, bilinear, and deconvolution strategies in a hybrid fusion scheme to enhance structural details while minimizing artifacts.
To better illustrate these improvements, we provide a visual comparison in
Figure 8. As shown, MI-Upsampling produces cleaner crack boundaries and suppresses pseudo-edges (highlighted in red), which are frequently present in outputs from the other two methods.
In summary, although MI-Upsampling yields moderate metric improvements, its architectural advantages—enhanced edge fidelity, detail preservation, and suppression of pseudo-edges—make it a vital component of the SECrackSeg pipeline. These enhancements contribute meaningfully to the model’s robustness and segmentation quality, particularly in challenging real-world crack scenarios.
3.3.4. Ablation Study on the Edge-Aware Attention Module
As shown in
Table 5, ablation experiments were conducted on the CFD dataset with different configurations of the Edge-Aware Attention Module:
Method 1: Output from the last decoder without any Edge-Aware Attention Modules.
Method 2: One Edge-Aware Attention Module connected to the last decoder.
Method 3: Two Edge-Aware Attention Modules connected to the last two decoders.
Method 4: Three Edge-Aware Attention Modules connected to the last three decoders.
The results in
Table 6 demonstrate the impact of using different numbers of Edge-Aware Attention Modules on the model’s performance. Method 1, with a single Edge-Aware Attention Module, shows a noticeable improvement in all metrics compared to not using the module. Method 2, which uses two modules, further enhances the performance, indicating that capturing edge details at multiple stages is beneficial. Method 3, with three modules, achieves the highest scores across all metrics, highlighting the effectiveness of the Edge-Aware Attention Module in improving segmentation accuracy, especially for fine-grained edge details
3.3.5. Ablation Study on Loss Function
As shown in
Table 7, ablation experiments were conducted on the CFD dataset with different configurations of the loss function.
The model’s loss function combines weighted binary cross-entropy () and weighted IoU loss () with multi-granularity supervision. Comparing the standard binary cross-entropy loss (), standard IoU loss (), the proposed weighted combination (), and the total loss with multi-granularity supervision (), the results show that the standard losses have lower performance. The weighted combination improves performance by emphasizing difficult pixels, and the total loss with multi-granularity supervision achieves the best results, with the highest Precision, Recall, F1-Score, and mIoU values. This demonstrates the effectiveness of the designed loss function in enhancing crack segmentation accuracy.
3.3.6. Ablation Study from Baseline to SECrackSeg
As shown in
Table 8, the baseline U-Net model achieves a Precision of 0.855, Recall of 0.716, F1-Score of 0.779, and mIoU of 0.680 on the CFD dataset. The integration of the
SAM2 S-Adapter significantly enhances performance across all metrics: Precision improves to 0.872 (
+1.7%), Recall to 0.803 (
+8.7%), F1-Score to 0.836 (
+5.7%), and mIoU to 0.820 (
+14.0%).This significant enhancement in the small-sample scenario indicates that the model can better extract features from limited data.
Subsequently, the MSDC module demonstrates substantial improvements in multi-scale feature fusion: Precision increases to 0.887 (+1.5%), Recall jumps to 0.860 (+5.7%), F1-Score reaches 0.873 (+3.7%), and mIoU rises to 0.838 (+1.8%). These gains highlight its critical role in capturing diverse crack scales through dilated convolutions.
The MI-Upsampling module then provides modest refinement: Precision slightly improves to 0.890 (+0.3%), Recall enhances to 0.875 (+1.5%), F1-Score achieves 0.882 (+0.9%), and mIoU stabilizes at 0.842 (+0.4%), validating its role in preserving structural details.
Finally, the Edge-Aware Attention mechanism elevates performance to peak levels: Precision peaks at 0.895 (+0.5%), Recall surges to 0.938 (+6.3%), F1-Score culminates at 0.915 (+3.3%), and mIoU reaches 0.854 (+1.2%), demonstrating exceptional edge segmentation capability.