1. Introduction
With the acceleration of global industrialization, cast components—valued for their high strength, dimensional accuracy, excellent wear resistance, and relatively low production cost—have become indispensable in mission-critical sectors including aerospace, automotive manufacturing, rail transit, and heavy industrial equipment [
1]. Yet, during casting, molten metal is highly susceptible to process-induced disturbances such as gas entrapment, nonmetallic inclusion, and thermally induced solidification heterogeneity, which frequently give rise to internal defects including pores and inclusions. Left undetected, these defects can precipitate catastrophic structural failures, significantly degrade mechanical integrity (e.g., fatigue life and tensile strength), compromise operational safety, and incur substantial economic losses across the product lifecycle [
2]. Consequently, automated, high-accuracy detection of internal casting defects has emerged as a pivotal quality assurance step for ensuring both component reliability and system-level operational resilience.
Currently, industry-standard non-destructive testing (NDT) techniques include radiographic testing (RT) [
3], ultrasonic testing (UT) [
4], and magnetic particle testing (MT) [
5]. Among these, X-ray-based inspection has gained prominence as a mainstream industrial modality owing to its unique capacity to penetrate geometrically complex castings, resolve sub-surface anomalies with high contrast sensitivity, and seamlessly integrate with digital image analysis pipelines. DR, a modern implementation of X-ray RT [
6], delivers intuitive, high-fidelity visualization of defect type, morphology, and spatial distribution—making it one of the most widely deployed NDT methods in foundry quality control. Nevertheless, conventional DR workflows remain heavily dependent on manual interpretation by expert inspectors—a practice that is inherently time-consuming, labor-intensive, and vulnerable to inter-operator variability and cognitive bias. This subjectivity undermines detection repeatability and throughput scalability, thereby impeding the adoption of DR in high-volume, precision-driven industrial inspection environments [
7].
To overcome the inherent limitations of manual DR image interpretation—including low throughput, inter-inspector inconsistency, and fatigue-induced errors—automated defect detection methods have been increasingly adopted. Early solutions predominantly relied on conventional image processing techniques. While such methods can achieve moderate accuracy in controlled settings, they are fundamentally constrained by their dependence on hand-crafted features (e.g., texture histograms, edge gradients, or morphological descriptors). Consequently, even minor variations in defect morphology, orientation, or background context necessitate labor-intensive feature engineering or parameter re-optimization, resulting in poor robustness to real-world imaging variability and severely limited generalizability across diverse casting geometries and alloy systems. Moreover, their multi-stage, rule-based pipelines lack end-to-end adaptability and are typically restricted to idealized, low-complexity inspection scenarios [
8,
9].
With the rapid advancement of deep learning, convolutional neural network (CNN)-based object detection frameworks [
10] have emerged as a dominant paradigm in industrial vision research and deployment [
11]. Unlike conventional image processing methods—which require explicit, domain-specific feature engineering—deep learning models automatically learn hierarchical, high-level semantic representations directly from raw pixel data via multi-layer nonlinear transformations. This end-to-end feature abstraction significantly enhances detection robustness across heterogeneous backgrounds, variable defect scales, and morphologically diverse casting anomalies. Early deep learning detectors were predominantly two-stage architectures, exemplified by Region-based CNN (R-CNN) [
12] and its accelerated variant Faster R-CNN [
13]. While achieving state-of-the-art accuracy on benchmark datasets, these models incur substantial computational overhead, exhibit slow inference latency, and lack real-time capability—rendering them impractical for high-throughput, inline industrial inspection systems with strict cycle-time constraints. To address this bottleneck, Redmon et al. introduced the pioneering one-stage detector You Only Look Once (YOLOv1) in 2016 [
14], which unified detection into a single regression task and achieved near real-time throughput (≥30 FPS) with competitive accuracy—thereby establishing a viable foundation for production-grade deployment. Subsequent generations—including YOLOv5 [
15,
16], YOLOv7 [
17,
18], and YOLOv8 [
19,
20]—have systematically refined architectural design, optimization strategies, and loss formulations, consolidating the YOLO family as the de facto standard for industrial defect detection due to their compelling combination of end-to-end trainability, low-latency inference, precise localization, and hardware-friendly deployment efficiency. More recent iterations, such as YOLOv11 [
21,
22] and YOLOv12 [
23], further advance this trade-off through adaptive feature pyramid designs, enhanced neck modules, and improved training schedulers, yielding superior accuracy–efficiency balance, particularly under challenging conditions including complex background textures and concurrent multi-defect occurrences. Concurrently, Transformer-based alternatives have gained traction: Real-Time Detection Transformer (RT-DETR) [
24,
25], an optimized derivative of DETR [
26], replaces hand-crafted anchor mechanisms with global self-attention modeling and eliminates non-maximum suppression (NMS), enabling end-to-end detection with both higher accuracy and markedly reduced inference latency—especially advantageous for large-scale defects and geometrically intricate casting structures. In parallel, a growing body of work focuses on targeted architectural adaptation rather than wholesale model replacement. For instance, Andriosopoulou et al. [
27] fused Faster R-CNN’s region proposal strength with YOLOv5’s speed via ensemble transfer learning to achieve accurate, real-time surface defect detection in high-pressure die-casting. Cheng et al. [
28] targeted internal weld defect detection and improved the YOLOv5s architecture by replacing the original convolution modules and introducing DIoU and CIoU loss functions, which effectively enhanced both detection accuracy and speed while reducing model parameters to enable real-time efficient detection. Wu et al. [
29] addressed the issues of low small-target detection accuracy and large parameter count of YOLOv7-Tiny on aluminum alloy weld DR images by introducing SPD-Conv layers and the SimAM attention mechanism to strengthen shallow feature representation, thereby significantly improving small-defect recognition performance. Chu et al. [
30] further improved the YOLOv8 algorithm by integrating dilated convolutions to enlarge the receptive field, combining the SimAM attention mechanism to enhance feature focusing, and adding an additional small-object detection head, ultimately achieving high-accuracy and high-efficiency detection of casting defects.
Although existing object detection algorithms have achieved certain success in industrial defect inspection, they still face substantial challenges when detecting casting defects—such as pores, inclusions, and looseness—in DR images. These defects are typically small in size, highly variable in morphology, and low in contrast against the background, causing high-frequency edge/texture details of small targets to attenuate easily during feature extraction and downsampling. As a result, tiny-defect features are difficult to capture effectively, and because the grayscale values of defect regions are often close to those of the surrounding matrix, localization accuracy degrades. Meanwhile, under complex metallurgical textures and multi-defect co-occurrence, insufficient modeling of channel–spatial dependencies can lead to false positives and localization drift. In addition, feature fusion often suffers from scale mismatch and inadequate information allocation, limiting multi-scale feature representation and weakening the model’s adaptability to defects at different scales, thereby impairing overall detection performance. To address these issues, we propose an edge–discrepancy collaborative defect detection method for casting DR images, namely MTS-YOLOv11: (1) To tackle the problem of “weak edge details of small defects that easily decay with downsampling,” we introduce a Multi-Scale Edge Information Enhancement module (MSEES) into the backbone to amplify high-frequency edge and texture responses and improve the perceptibility of tiny defects; (2) to address “insufficient dependency modeling under complex backgrounds and multi-defect coexistence,” we embed TripletAttention in high-level backbone stages to jointly optimize channel and spatial dependencies, enhancing defect responses while suppressing background interference; (3) to mitigate “scale mismatch and background information leakage during shallow–deep fusion,” we design a Scale-Discrepancy-Aware Gated Fusion module (SDAGFusion) before the detection head to explicitly coordinate complementary multi-scale features, enabling high-accuracy and robust detection of pores, inclusions, and looseness. Importantly, MTS-YOLOv11 is not a simple plug-in combination of generic attention modules or multi-scale fusion blocks. Instead, it is built around the key bottlenecks observed in casting DR defect detection-weak edge cues, pronounced scale discrepancy, and strong interference from complex background textures, thus adopting an edge–discrepancy collaborative design. Specifically, MSEES reinforces faint edge and texture details at the early feature-extraction stage to improve the perceptibility of tiny defects; TripletAttention suppresses non-defect activations induced by complex backgrounds in high-level features, reducing false positives and alleviating localization drift; and SDAGFusion dynamically allocates and fuses cross-scale information through a discrepancy-aware gating mechanism, forming a coherent pipeline (edge enhancement–dependency calibration–discrepancy-aware fusion). This synergistic combination distinguishes our method from prior YOLO variants that merely introduce attention blocks or adopt generic multi-scale fusion strategies. The principal contributions of this work are as follows:
1. We propose MTS-YOLOv11—a dedicated detection architecture for casting DR imagery—designed to jointly address the core challenges of small-scale defect localization, low-contrast discrimination, and background heterogeneity. Structural modifications are systematically integrated across the backbone, neck, and head to enhance feature fidelity, contextual coherence, and detection sensitivity, while keeping the computational overhead within a practical range for industrial deployment.
2. A Multi-Scale Edge Information Enhancement System (MSEES) is embedded within the C3K2 module of the backbone. By amplifying high-frequency edge gradients and textural discontinuities in shallow-layer feature maps, MSEES enhances edge/detail cues for small-scale defects, thereby improving early-stage feature perceptibility and reducing fine-detail loss during subsequent downsampling.
3. A TripletAttention mechanism is deployed in the high-level backbone stages to better model channel–spatial dependencies under complex metallurgical textures. By calibrating feature responses along channel–height, channel–width, and spatial dimensions, TripletAttention mainly helps suppress background-induced activations and reduce false positives, thereby improving detection robustness and alleviating localization drift, particularly for inclusion defects that are easily confused with noise patterns.
4. A Scale-Discrepancy-Aware Gated Fusion (SDAGFusion) module is inserted immediately before the detection head to explicitly address scale mismatch during shallow–deep feature fusion. Through a discrepancy-aware gating strategy, SDAGFusion adaptively allocates complementary multi-scale information while suppressing redundant background features, leading to more coherent cross-scale representations and enhanced sensitivity to small, low-contrast defects.
The remainder of this paper is organized as follows.
Section 2 details the MTS-YOLOv11 framework, beginning with a concise overview of the baseline YOLOv11 architecture and followed by principled descriptions of the three novel modules—MSEES, TripletAttention, and SDAGFusion—including their architectural rationale, integration points, and functional objectives.
Section 3 presents comprehensive experimental validation: quantitative comparisons against eleven state-of-the-art detectors on a standardized casting DR benchmark, ablation studies verifying module efficacy, and qualitative visualizations demonstrating enhanced localization precision and background rejection capability.
Section 4 concludes the work, summarizes key technical insights, and discusses practical implications for industrial inline inspection systems.
4. Conclusions
To enhance the detection performance of small-scale, low-contrast defects such as pores, inclusions, and looseness in casting DR images under complex metallurgical backgrounds, this paper proposes MTS-YOLOv11, an edge–discrepancy collaborative detection framework built upon YOLOv11. The method introduces a Multi-Scale Edge Information Enhancement System (MSEES) in shallow backbone stages, a TripletAttention mechanism in high-level features, and a Scale-Discrepancy-Aware Gated Fusion (SDAGFusion) module before the detection head. These three components jointly optimize feature representation from the perspectives of “edge detail enhancement,” “channel–spatial dependency calibration,” and “scale-discrepancy-aware adaptive fusion.” The effectiveness and engineering feasibility of the proposed approach are systematically validated through ablation studies, cross-dataset generalization experiments, inference efficiency analysis, and multi-seed repeatability tests. The main conclusions are as follows:
1. The proposed MSEES module operates in the shallow backbone to amplify geometrically salient edge gradients and textural discontinuities, thereby improving sensitivity to sub-pixel defect boundaries and particularly benefiting the detection of dense pores and fine inclusions. The TripletAttention mechanism, embedded in high-level backbone stages, jointly recalibrates feature responses along three dimensions—channel–height, channel–width, and native spatial space—selectively enhancing activations at true defect locations while suppressing spurious responses induced by grain-boundary noise and complex background textures. This effect is especially pronounced for inclusions, which are morphologically irregular and prone to confusion with the background. The SDAGFusion module is placed immediately before the detection head and explicitly models the differences in spatial resolution and semantic abstraction between shallow and deep features. By performing pixel-wise adaptive reweighting between “detail features” and “semantic features,” SDAGFusion mitigates conflicts across hierarchical representations, suppresses background redundancy, and produces more spatially coherent and discriminative fused features, providing more reliable inputs for subsequent classification and localization heads. Together, these three modules form a tightly coupled “edge enhancement–dependency calibration–discrepancy-aware fusion” pipeline, realizing the proposed edge–discrepancy collaborative design rather than a simple stacking of generic attention or multi-scale fusion blocks.
2. On the casting DR dataset, MTS-YOLOv11 achieves mAP@0.5 = 96.5% and mAP@0.5:0.95 = 68.5%, representing stable improvements of approximately 1.3 and 1.2 percentage points over the baseline YOLOv11. In multi-seed repeatability experiments, MTS-YOLOv11 maintains an mAP@0.5 level of about 96.5 ± 0.2, and its worst result still exceeds the best result of YOLOv11, indicating that the performance gain is repeatable rather than an artifact of a single run. Meanwhile, the model remains compact, with only 2.72 M parameters and 7.8 GFLOPs, and achieves an inference speed of 359.07 FPS on the same platform (versus 346.86 FPS for YOLOv11), reflecting a favorable balance among lightweight design, real-time performance, and accuracy improvement.
3. On a newly collected casting DR dataset, MTS-YOLOv11 attains mAP@0.5 = 91.3% and mAP@0.5:0.95 = 65.4%, and outperforms multiple comparative methods across all three defect types (pore, inclusion, and looseness). This demonstrates that the proposed edge–discrepancy collaborative design maintains strong cross-scenario adaptability and engineering applicability when workpiece structure, X-ray projection geometry, and background texture vary. At the same time, despite these improvements, failure case analysis shows that the model still has limitations in handling extremely dense small-target clusters and defects that severely overlap with strong structural edges. Future work may incorporate instance-level segmentation priors, stronger global context modeling, and uncertainty-aware post-processing to further enhance robustness in complex industrial inspection environments.