YOLO-MAD: Multi-Scale Geometric Structure Feature Extraction and Fusion for Steel Surface Defect Detection

Ding, Hantao; Chen, Junkai; Ye, Hairong; Chen, Yanbing

doi:10.3390/app15147887

Open AccessArticle

YOLO-MAD: Multi-Scale Geometric Structure Feature Extraction and Fusion for Steel Surface Defect Detection

School of Computer Science and Technology, Zhejiang Sci-Tech University, Hangzhou 310018, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(14), 7887; https://doi.org/10.3390/app15147887

Submission received: 31 May 2025 / Revised: 1 July 2025 / Accepted: 5 July 2025 / Published: 15 July 2025

Download

Browse Figures

Versions Notes

Abstract

Lightweight visual models are crucial for industrial defect detection tasks. Traditional methods and even some lightweight detectors often struggle with the trade-off between high computational demands and insufficient accuracy. To overcome these issues, this study introduces YOLO-MAD, an innovative model optimized through a multi-scale geometric structure feature extraction and fusion scheme. YOLO-MAD integrates three key modules: AKConv for robust geometric feature extraction, BiFPN to facilitate effective multi-scale feature integration, and Detect_DyHead for dynamic optimization of detection capabilities. Empirical evaluations demonstrate significant performance improvements: YOLO-MAD achieves a 5.4% mAP increase on the NEU-DET dataset and a 4.8% mAP increase on the GC10-DET dataset. Crucially, this is achieved under a moderate computational load (9.4 GFLOPs), outperforming several prominent lightweight models in detection accuracy while maintaining comparable efficiency. The model also shows enhanced recognition performance for most defect categories. This work presents a pioneering approach that balances lightweight design with high detection performance by efficiently leveraging multi-scale geometric feature extraction and fusion, offering a new paradigm for industrial defect detection.

Keywords:

geometric structure features; multi-scale feature fusion; industrial inferior product detection; object detection

1. Introduction

With the accelerating pace of industrialization and the continuous advancement of infrastructure construction, industrial materials such as steel and aluminum are increasingly being applied in fields including aerospace, energy, and automotive manufacturing. Ensuring product reliability and safety has made surface defect detection a critical focus of quality management [1]. Traditional defect detection methods rely heavily on human expertise, image processing algorithms, or hand-crafted feature extractors [2,3]. Although these approaches can yield acceptable results for regular defects under controlled conditions, they generally suffer from inconsistent detection accuracy, particularly for subtle or complex defect morphologies (e.g., micro-cracks, non-uniform corrosion patterns) [4], and limited adaptability to complex industrial backgrounds characterized by noise, texture variations, or overlapping structures [5].

Recent advancements in deep learning-based object detection [6] have rapidly evolved and demonstrated significant advantages in industrial defect inspection due to their powerful feature extraction and pattern recognition capabilities. Among these, the YOLO architecture [7] has gained prominence in industrial applications due to its real-time processing and end-to-end design. However, conventional YOLO models still face several challenges when dealing with surface defects in real-world steel inspection scenarios, such as difficulties in recognizing structurally irregular or slender defects, significant reductions in detection accuracy, relatively static feature fusion mechanisms that fail to capture salient differences across multi-scale defects, and shallow shared convolutional structures in the original detection head that lack task awareness, leading to mutual interference between classification and localization [8].

To overcome the constraints of YOLOv8n in steel surface defect detection, particularly its limited perceptual scope, suboptimal integration of multi-level features, and coupled classification–regression operations, this paper presents YOLO-MAD, an enhanced lightweight framework. This approach adopts a modular design, maintaining the architectural advantages of the YOLOv8 backbone while introducing three key improvements:

AKConv (adaptive kernel convolution) is integrated into the backbone to replace standard convolution modules. By introducing dynamic sampling locations and a learnable mask mechanism, AKConv enables the convolutional kernels to adaptively focus on fine-grained and irregular defects (e.g., micro-cracks), substantially boosting the network’s sensitivity to minute structural discontinuities.
BiFPN (bidirectional feature pyramid network) is adopted to replace the original PANet in the neck. By incorporating a learnable weighted feature fusion strategy, BiFPN strengthens the information flow across multi-scale features and facilitates effective interaction among features at different levels. This design improves the model’s expressive power in handling defects of varying scales.
Detect_DyHead (Detect_DynamicHead) is employed in the detection head, which integrates novel task-specific, spatial–contextual, and scale-sensitive attention layers on top of the existing branch separation. This integration further enhances the response to critical regions and improves robustness and accuracy in complex textured backgrounds beyond the baseline separation.

This paper is structured as follows: Section 2 reviews related work on the YOLO architecture and industrial defect detection. Section 3 presents the overall framework of YOLO-MAD and elaborates on the design principles and implementation details of each module. Section 4 provides performance comparisons on the NEU-DET and GC10-DET datasets, supplemented by ablation studies and visual analyses. Section 5 critically examines the empirical findings from Section 4 through rigorous quantitative and qualitative assessments. Finally, Section 6 synthesizes key contributions and acknowledges current constraints, and outlines promising avenues for subsequent research.

2. Related Work

2.1. YOLO

In the field of object detection algorithms, the YOLO series [9,10] has become extremely influential due to its end-to-end architecture and fast inference speed. From YOLOv3’s multi-scale prediction and residual networks [11], through YOLOv5’s in-depth model and training optimization [12], to YOLOv7 [13] and YOLOv8’s [14,15] lightweight design and attention mechanisms integration, this series has continuously improved detection accuracy and efficiency.

However, when applied to industrial defect detection, the YOLO series faces challenges in modeling complex deformations. Confronted with scenarios such as complex backgrounds, small objects, and non-rigid targets, researchers have integrated advanced modules like DCN [16], multi-scale attention mechanisms [17], and lightweight convolution units (e.g., GhostConv [18], ShuffleNet [19]) into the YOLO framework to enhance model perception. However, these improvements are mostly simple combinations of single modules, lacking systematic fusion to comprehensively boost industrial defect identification.

A significant recent trend, particularly pronounced in 2024–2025, has been the emergence of highly efficient YOLO variants explicitly optimized for industrial defect detection. Models like FMV-YOLO [20], WSS-YOLO [21], etc., represent the latest state-of-the-art advancements.

These models share the following core design objectives with our work: enhancing performance on challenging industrial defects under tight computational constraints. They commonly employ techniques such as multi-scale feature fusion, deformable convolutions, various attention modules (e.g., channel, spatial, hybrid), and aggressive lightweighting strategies, and have been extensively evaluated on standard industrial defect benchmarks like NEU-DET and GC10-DET. They each offer unique innovations; FMV-YOLO utilizes feature map variation modules, and WSS-YOLO explores weighted spatial sampling.

However, these approaches often optimize either speed or accuracy aggressively, sometimes incurring trade-offs (e.g., quantization loss, limited generalization to diverse defects) or introducing substantial complexity. Furthermore, there remains potential to achieve a more synergistic integration of multi-scale attention and adaptive spatial features specifically tailored for the geometric and textural variability inherent in steel defects.

As an influential iteration, YOLOv8 incorporates key enhancements, as follows: (1) an evolved CSPDarknet backbone for richer feature extraction; (2) a novel anchor-free head (Detect) for simplified structure; (3) advanced attention like the context aggregation module (CA); (4) flexible model scaling for diverse deployments. Its speed–accuracy balance makes it a strong base for complex industrial inspection.

Consequently, building upon the foundations laid by YOLOv8 and recent lightweight advancements, this paper proposes YOLO-MAD, a YOLO-based multi-scale attention and deformable convolution fusion method. It enhances the response to defect salient regions with integrated multi-scale and multi-channel attention mechanisms, and improves spatial adaptability to deformed defects using strategically placed deformable convolution modules. This synergistic fusion of multi-scale attention and deformable convolution (MAD) aims to significantly boost robustness and accuracy in complex industrial inspection scenarios without compromising efficiency.

2.2. Industrial Defect Detection

Industrial surface defect detection serves as a critical component in manufacturing quality assurance, significantly enhancing product reliability and safety. Current methodologies can be broadly classified into the following two paradigms: conventional machine vision systems and modern deep learning solutions. Traditional machine vision approaches extract defect features via threshold segmentation, edge detection, etc., but have limited performance for complex industrial surface defects and are sensitive to lighting and noise. As deep learning technology developed, CNN-based detection methods became mainstream. Examples include the 2016 R-CNN by Ren et al. [22], the 2016 YOLOv1 by Redmon et al. [9], and the 2017 Mask R-CNN by He et al. [23] However, these methods still fall short when dealing with tiny or irregular defects. Subsequent innovations have introduced hybrid solutions to address these constraints, such as Wang et al.’s GAN-based data augmentation [24] for enhanced generalization, followed by breakthroughs in self-supervised learning [25] and multi-modal feature fusion methods [26].

In particular, for steel surface defect detection, research has actively explored integrating state-of-the-art architectural components into detectors. AKConv (arbitrary kernel convolution) [27], with its ability to dynamically learn optimal kernel shapes and offsets, has been integrated into various detectors to better model the diverse morphologies of steel defects like scratches, inclusions, and patches, showing promise in handling irregular deformations. The BiFPN (bidirectional feature pyramid network) [28] is widely adopted to enhance multi-scale feature fusion, proving particularly effective in improving the detection of small and low-contrast defects prevalent in steel imagery. Similarly, mechanisms inspired by Dynamic Head [29], focusing on learning scale-aware, spatial-aware, and task-aware attention in detection heads, have been successfully applied to boost the sensitivity and discrimination capability for intricate and ambiguous defect features on steel surfaces. These technologies represent leading-edge solutions addressing core challenges in the domain.

Nevertheless, existing deep-learning-based methods still face challenges in handling tiny/irregular defects and class-imbalance issues. Improving detection models’ real-time performance and robustness for complex industrial environments remains an urgent problem. In light of these challenges, we conducted a thorough review of several notable enhancements to YOLO-based models. For instance, Ling Wang et al. assembled the CBS, MS, and SPPF modules to implement multi-scale block and spatial attention to improve YOLOv5 [30]. Congzhe You et al. enhanced YOLOv8 by introducing channel-wise attention, spatial-wise attention, and full 3D weights for attention [31]. Additionally, Hongkai Zhang et al. implemented a multi-scale feature fusion (MSF) module and an attention mechanism residual block (CRA block) to optimize the traditional YOLOv5 [32]. These approaches have significantly influenced the development of our “Multi-Scale Geometric Structure Feature Extraction and Fusion” concept. Critically, while the aforementioned advanced modules (AKConv, BiFPN, dynamic head) have demonstrated efficacy in isolation or pairwise combinations within previous defect detection frameworks, a comprehensive and synergistic integration targeting all aspects of feature representation (feature extraction, multi-scale fusion, and discriminative prediction head) specifically for steel defects remains largely unexplored. Their combined potential for maximizing detection performance is yet to be fully harnessed.

To overcome these shortcomings, this paper introduces a novel metal surface defect detection framework that integrates three advanced components, as follows: the BiFPN feature fusion strategy from EfficientDet, the superior AKConv dynamic variable convolution kernel, and the Dynamic Head 3D attention mechanism. It achieves precise detection of complex steel surface defects by synergistically enhancing feature extraction adaptability through AKConv, optimizing multi-scale context aggregation via BiFPN, and boosting discrimination power in the detection head using the dynamic head principle, within a unified architecture based on YOLOv8.

3. Proposed Method

This paper introduces YOLO-MAD, an optimized lightweight detection framework derived from YOLOv8n’s architecture. To significantly boost its feature extraction capabilities, fusion expression abilities, and classification and localization accuracy in industrial defect detection tasks, we integrated three specialized modules tailored for fine-grained and multi-scale feature processing, namely, AKConv, BiFPN, and Detect_DyHead. This synergistic combination enables dynamic multi-scale feature representation, significantly advancing steel defect detection performance.

As depicted in Figure 1, YOLO-MAD implements comprehensive enhancements across the original YOLOv8n architecture’s backend, neck, and head components. The AKConv module employs deformable convolution operations to better characterize minute and geometrically complex defects. The BiFPN module, through its multi-scale bidirectional fusion mechanism, markedly improves the efficiency of information interaction between disparate feature levels. Furthermore, Detect_DyHead incorporates a multi-dimensional attention mechanism, which in turn boosts the detection head’s responsiveness to crucial regions and refines its task decoupling capabilities. These architectural innovations collectively achieve superior detection precision with minimal computational burden.

3.1. Enhancing Feature Extraction with AKConv

While the YOLOv8 architecture generally excels in generic object detection, its performance in the NEU-DET defect detection task revealed a bottleneck, particularly with crazing and rolled-in scale defects, where mAP values were significantly lower than other categories. This issue primarily stems from the characteristics of these defects; crazing appears as elongated, irregular, and often blurred structures, while rolled-in scale defects exhibit high shape variability. Standard convolutional kernels (e.g., 3 × 3 or 5 × 5) with fixed sampling patterns struggle to capture these fine-grained spatial features. To address this, we integrated AKConv modules to replace some convolutional layers within YOLOv8’s backbone, thereby enhancing the model’s spatial perception flexibility and geometric expression capabilities [27].

AKConv redefines conventional convolution through its dynamic adaptation mechanism, which employs dual control modules to intelligently regulate both kernel geometry and parameter configuration. As shown in Figure 2, this innovative architecture and its operational workflow are comprehensively demonstrated.

AKConv receives an input tensor of shape

(C, H, W)

, where C denotes the number of channels, and H and W are spatial dimensions. It introduces a learnable offset prediction module, implemented as a convolution (

{Conv}_{offset}

) that outputs

2 N

channels representing horizontal and vertical offsets

(Δ x_{i}, Δ y_{i})

for each of the N sampling positions.

To better understand how AKConv modifies standard convolution, we revisit the original fixed sampling formulation:

Conv (p_{0}) = \sum_{i} w_{i} \cdot x (p_{0} + p_{n}^{i})

(1)

Here,

p_{0}

is the central position,

p_{n}^{i}

denote the predefined offsets of the i-th kernel element, and

w_{i}

denote the learned kernel weights.

In AKConv, additional learnable offsets

Δ p^{i}

are added:

p^{i} = p_{0} + p_{n}^{i} + Δ p^{i}

(2)

This can also be expressed in terms of coordinates:

(x_{i}^{'}, y_{i}^{'}) = (x_{i} + Δ x_{i}, y_{i} + Δ y_{i})

(3)

Here,

Δ x_{i}

and

Δ y_{i}

implicitly include both the predefined relative offsets

p_{n}^{i}

and the learned dynamic offsets

Δ p^{i}

.

Since

p^{i}

may lie between pixel locations, bilinear interpolation is applied:

x (p^{i}) = \sum_{q \in N (p^{i})} G (q, p^{i}) \cdot x (q)

(4)

where

N (p^{i})

includes the four surrounding integer locations, and

G (q, p^{i})

is the interpolation weight. This makes the sampling differentiable, allowing gradients to backpropagate to

Δ p^{i}

, which are trained indirectly via the detection loss

L_{\det}

.

The adaptively sampled feature map is reshaped and passed through a convolutional block (including normalization and SiLU activation) to produce the final output. This process allows AKConv to dynamically adapt its sampling to better capture irregular or fine-grained features [33].

Figure 3 delineates the dynamically adaptive morphological configurations and their parameter polymorphism in AKConv. Figure 4 illustrates a simplified example of AKConv. The left panel shows an image selected from the inclusion dataset, and the right panel displays an image from the rolled-in scale dataset. The yellow outlines represent the predefined initial convolution shape (using example (3) from Figure 3b), while the blue outlines depict the transformed convolution shape, with green indicating overlapping parts (simplified for demonstration purposes). Visually, defect areas present as randomly distributed linear flaws and dark spots. Similarly, AKConv adaptively adjusts both the cardinality and topology of convolutional kernel parameters based on the characteristics of the migration feature map. This allows the convolutional kernel to deviate from traditional regular sampling grids, actively aligning with the spatial structures of asymmetric defects like cracks and spots. Concurrently, its dynamic parameter control capability enables the kernel to flexibly allocate parameters and convolution points based on the importance of different regions, which improves the model’s adaptability to local features and its responsiveness to complex geometric shapes and varying targets. Furthermore, by learning saliency weights for each sampling position, AKConv effectively suppresses spurious responses caused by background textures. These integrated characteristics enhance the model’s discriminative capability and robustness when processing images with intricate textures and varied defect morphologies, thereby further resolving the issue of low mAP in YOLOv8’s original convolutions for crack-like defect detection.

3.2. Improving Multi-Scale Feature Fusion with BiFPN

Steel surface defect detection often varies significantly in size and morphology. Unlike objects in natural images with strong semantic consistency, steel surfaces can have diverse defect types like minute cracks, large oxidation spots, and irregular indentations. This multi-scale heterogeneity significantly increases the challenge of feature representation for detection models. While the PANet structure in the original YOLOv8 offers some cross-layer feature fusion capabilities and outperforms traditional FPN, its linear weighting scheme cannot fully model the relative importance of features at different resolutions. This fusion approach risks suppressing high-level semantic features by excessive low-level detail representation, while potentially allowing large-scale features to dominate the integrated output, consequently degrading detection performance for small objects. To tackle this, we replace YOLOv8’s original module with BiFPN. The schematic representations of FPN, PANet, and BiFPN network architectures are shown in Figure 5. By integrating BiFPN, we achieve adaptive feature selection and information enhancement, thereby improving overall detection accuracy [28].

Compared to the PANet module utilized in YOLOv8, BiFPN implements systematic replacements and optimizations across several key dimensions.

First, concerning information flow design, BiFPN establishes a bidirectional feature propagation path, simultaneously supporting both top-down and bottom-up multi-level feature interaction. This design enables more comprehensive fusion of hierarchical semantic and spatial information, thereby improving feature representation completeness.

Second, regarding cross-scale connections, BiFPN introduces several optimizations to enhance efficiency and fusion quality:

Removal of single-input nodes: BiFPN removes single-input nodes (those with only one input edge). Since these non-fusion nodes contribute negligibly to the network’s feature integration objective. This optimization yields a more efficient bidirectional topology without compromising fusion performance.
Same-level skip connections: Extra connections are established between corresponding input and output nodes residing at the same level. This facilitates the merging of a greater number of features without a substantial rise in computational expense.
Repeated multi-layer fusion: Unlike PANet’s restricted single top-down and bottom-up pathway structure, BiFPN considers each bidirectional (top-down and bottom-up) pathway as an individual feature network layer. This identical layer is subsequently iterated multiple times to facilitate the integration of more advanced features.

To further improve feature fusion quality, BiFPN incorporates a lightweight feature selection mechanism. This mechanism dynamically adjusts the importance of each input feature through learnable weights, suppressing redundant feature interference and strengthening critical information channels, making it particularly sensitive to the representation of minute defect regions. Its fast normalized fusion formula is expressed as follows:

O = \sum_{i} \frac{w_{i}}{ϵ + \sum_{j} w_{j}} \cdot I_{i},

(5)

Here,

I_{i}

signifies the i-th input feature,

w_{i}

denotes its corresponding trainable fusion weight, and O corresponds to the aggregated output. This adaptive mechanism enables dynamic weighting of input features during fusion based on their relative importance, thereby enhancing both the flexibility and robustness of the feature integration process.

Combining bidirectional cross-scale pathways with efficient normalized fusion (exemplified by BiFPN’s level 6 in Figure 5), the fusion process is formalized as follows:

P_{6}^{t d} = C o n v (\frac{w_{1} \cdot P_{6}^{i n} + w_{2} \cdot R e s i z e (P_{7}^{i n})}{w_{1} + w_{2} + ϵ}),

(6)

P_{6}^{o u t} = C o n v (\frac{w_{1}^{'} \cdot P_{6}^{i n} + w_{2}^{'} \cdot P_{6}^{t d} + w_{3}^{'} \cdot R e s i z e (P_{5}^{o u t})}{w_{1}^{'} + w_{2}^{'} + w_{3}^{'} + ϵ}),

(7)

In these equations,

R e s i z e

denotes the sampling operation for scale matching, while

C o n v

represents the convolution-based feature manipulation.

P_{6}^{i n}

corresponds to level 6 in the top-down propagation, while

P_{6}^{o u t}

represents the corresponding output feature at level 6 of the bottom-up pathway. All hierarchical features follow this similar representation scheme.

Ultimately, by employing the aforementioned optimization techniques, BiFPN significantly enhanced the fusion efficiency and detection accuracy of the YOLO-MAD framework in steel surface defect detection tasks. Specifically, as shown in Figure 1, BiFPN is integrated at four locations within the model architecture, marked with their respective colors, with depths set to 2, 3, 3, and 2, respectively. Additionally, it ensures alignment with the input requirements of the Detect_DyHead detection head.

3.3. Detect_DyHead: Enhancing Detection with 3D Decoupled Attention

In industrial defect detection tasks, the static fusion strategies of traditional detection heads struggle to adapt to the complex characteristics of defect targets. For instance, micron-level scratches require high-resolution features for precise localization, while randomly distributed point defects necessitate robust spatial context modeling. Furthermore, conventional classification–regression shared convolutions often lack task-specific feature decoupling, leading to semantic confusion within the detection head against noisy backgrounds. To mitigate the aforementioned problems, this paper integrates the Detect_DyHead, built upon a three-dimensional decoupling attention structure. As shown in Figure 6, Detect_DyHead achieves efficient collaboration across multi-scale and multi-task aspects in object detection through multi-task decoupling branches and a fine-grained dynamic router. By introducing a self-attention mechanism, Detect_DyHead decomposes the feature tensor into three orthogonal dimensions—level, space, and channel—to construct a cascaded dynamic attention module. This design enhances the model’s sensitivity to fine-grained defect regions while also reinforcing the synergistic optimization between classification and regression tasks, thereby boosting its effectiveness in handling complex industrial defect detection scenarios [29].

As shown in Figure 7, Detect_DyHead’s three perception enhancement modules achieve feature optimization through attention mechanisms in different dimensions, and the specific attention formulas are as follows:

W (F) = π_{C} (π_{S} (π_{L} (F) F) F) F,

(8)

Here, three distinct attention functions,

π_{L}

,

π_{S}

, and

π_{C}

, operate along the L, S, and C dimensions, respectively. L represents the feature pyramid level,

S = H \times W

denotes spatial dimensions (H: height, W: width), and C represents the channel dimension.

The attention functions for the three dimensions are described below:

π_{L}

is a scale-aware attention mechanism that employs global average pooling (GAP) to compress spatial and channel dimensions, generating scale-sensitive feature descriptors. Combined with

1 \times 1

convolutions and a hard Sigmoid function, it dynamically allocates level weights. This enables adaptive feature selection and fusion based on scale importance. The specific formula is as follows:

π_{L} (F) F = σ (f (\frac{1}{S C} \sum_{S, C} F)) F,

(9)

Here,

f ()

is analogous to a linear transformation, composed of

1 \times 1

convolutional operations. The hard sigmoid activation

σ (x) = m a x (0, m i n (1, x + 1 / 2))

is applied for non-linear processing.

π_{S}

is a spatially-aware attention mechanism that focuses on spatial-location-related information. It uses deformable convolutions to let attention learn sparsification. By adjusting sparse-sampling positions via learnable offsets (

Δ p_{k}

), the convolution kernel adaptively fits defect geometry (e.g., crack propagation direction) and aggregates elements from different levels at corresponding spatial positions.

π_{S} (F) F = \frac{1}{L} \sum_{l = 1}^{L} \sum_{k = 1}^{K} w_{l, k} \cdot F (l; p_{k} + Δ p_{k}; c) \cdot Δ m_{k},

(10)

Here, k specifies the count of sparse sampling locations.

p_{k} + Δ p_{k}

incorporates learned spatial offsets, with

Δ p_{k}

being derived through deformable convolution operations.

Δ m_{k}

indicates the self-learned importance of that position. Both are adaptively generated from the intermediate feature representations of the input tensor

F

.

π_{C}

is a task-aware attention mechanism that dynamically activates task-sensitive channels using learnable parameters (

α^{1}

,

β^{1}

and

α^{2}

,

β^{2}

). For instance, classification tasks focus on texture differences while regression tasks enhance edge gradients. The specific formulas are as follows:

π_{C} (F) F = max (α^{1} (F) F_{c} + β^{1} (F), α^{2} (F) F_{c} + β^{2} (F)),

(11)

Here,

F_{c}

is the feature slice of the c-th channel, and the threshold in

m a x ()

is a hyperparameter function for learning to activate channels.

In this paper, the original detection head was replaced with Detect_DyHead, configured at the positions indicated by the orange blocks in Figure 1. It accepts the output parameters from the C2f block and generates the output, implementing multi-attention mechanisms to enhance the model’s detection accuracy.

4. Experiment

4.1. Dataset and Experimental Configuration

Our investigation primarily leveraged two distinct datasets: the NEU-DET and GC10-DET datasets. The NEU-DET dataset serves as an authoritative benchmark within the steel surface defect detection, encompassing 1800 grayscale images, each originally 200 × 200 pixels in size. It comprehensively covers six typical categories of steel surface imperfections, as depicted in Figure 8 [34,35,36], and features meticulous bounding box annotations, providing precise information on both the type and location of multiple defects present in an image. To maintain independence and ensure reliable model evaluation, we randomly split the NEU-DET dataset into training and testing sets using a 4:1 ratio, as illustrated in Figure 8.

Complementing the NEU-DET, the GC10-DET dataset was employed for cross-dataset evaluation [37]. This dataset features high-resolution images (2048 × 1000 pixels) and covers 10 representative steel surface defect categories, thus offering a broader spectrum of defect types. Employing GC10-DET allowed us to further substantiate our model’s reliability and robustness.

All experiments utilized images resized to 200 × 200 pixels as the standardized input. The model optimization employed the Adam optimizer, starting with a learning rate of 0.01 and a weight decay parameter set at 0.0005. Training spanned 200 epochs, using batches of 16 samples. Computational operations were executed on a system outfitted with an NVIDIA GeForce RTX 3060 GPU, configured with CUDA 11.8 and PyTorch 2.0.0. Additionally, Mosaic augmentation was employed during training, which merges four images into one to enhance dataset diversity and improve model generalization.

Similar to the NEU-DET dataset, as detailed in Table 1, the GC10-DET dataset was also randomly divided into training and testing sets at a 4:1 ratio (80% for training, 20% for testing), ensuring the consistency of sample proportions across each defect class during the division process.

4.2. Performance Comparison with SOTA Approaches

To thoroughly assess the performance enhancements and lightweighting achieved by the YOLO model following module modifications in this study, we used two key indicators, namely, mean average precision (mAP) and the number of floating-point operations (GFLOPs). These quantitative measures provide a multidimensional assessment of the model’s effectiveness.

Average precision (AP) is computed as the integral of the precision–recall curve, and its calculation formula is as follows:

A P = \int_{0}^{1} p (r) d r,

(12)

Here, the precision–recall function

p (r)

denotes the precision at recall level r. The integration operation effectively combines precision measurements across the entire spectrum of recall, offering a robust and balanced assessment of the model’s ability to correctly identify instances of a specific class while minimizing false detections. To obtain a comprehensive evaluation across all object categories, the mean average precision (mAP) is obtained by averaging the individual AP scores over all

N_{c}

object categories, calculated as follows:

m A P = \frac{1}{N_{c}} \sum_{i = 1}^{N_{c}} A P_{i},

(13)

Floating-point operations (GFLOPs), representing giga floating-point operations, quantify a model’s computational complexity. In practical industrial applications, particularly on edge devices or in scenarios demanding high real-time performance, a model’s computational efficiency is paramount. By comparing the GFLOPs of the model before and after modifications, we can quantify the improvement in model lightweighting. This study aims to effectively decrease the model’s computational burden while simultaneously enhancing detection performance through module modifications. Therefore, GFLOPs is a key indicator to verify the achievement of our model’s lightweighting objective.

Table 2 shows that WSS-YOLO [21] achieves the highest overall mAP of 82.3% on the NEU-DET dataset. Nevertheless, our proposed YOLO-MAD demonstrates highly competitive performance, attaining a mAP of 76.6%. This result significantly outperforms all other baseline YOLO models by a considerable margin, highlighting the effectiveness of its architectural design. Our model also secures the best performance in half of the defect categories, that is, IN with 85.0%, PA with 94.2%, and SC with 93.0%. These strong results are largely attributed to the AKConv module, which excels at capturing the distinct geometric structural features of these defect types, often characterized by their clear and relatively larger shapes. Regarding computational efficiency, YOLO-MAD maintains computational frugality (9.4 GFLOPs, 4.27 M parameters), placing it firmly in the lightweight model category suitable for real-time applications. Compared to the official baselines, YOLO-MAD delivers this substantial performance gain with only a modest increase in computation over YOLOv8n (8.1 GFLOPs) and remains vastly more efficient than larger variants like YOLOv8l (164.8 GFLOPs).

However, YOLO-MAD shows lower performance than WSS-YOLO on defects like crazing (CR). This limitation primarily stems from CR defects often exhibiting features (color, texture) highly similar to the background. In such cases with low feature contrast, the AKConv module faces challenges in precisely extracting and delineating the subtle spatial structures characteristic of CR from the noisy background.

As presented in Table 3, YOLO-MAD continues to demonstrate strong generalization, achieving a robust mAP of 68.4% on the more diverse GC10-DET dataset. This performance surpasses all compared baseline YOLO models (YOLOv8n/l, YOLOv11n/l) as well as traditional detectors like Libra Faster R-CNN and RetinaNet, validating its capability across varied defect scenarios. WSS-YOLO [21] maintains the highest overall mAP on this dataset at 72.0%, maintaining a relatively small gap. YOLO-MAD exhibits high-level performance across multiple defect types, securing the best scores for Cg at 96.0% and Wf at 77.1%. These results underscore YOLO-MAD’s robustness and effectiveness in handling diverse and complex defect patterns prevalent in industrial steel surfaces.

Despite YOLO-MAD’s overall exceptional performance, a slight decrease in accuracy is observed for several specific defect types (such as Os, Ss, and Cr). This indicates that there is still room for improvement. From a data perspective, the decline is attributed to the unique characteristics of these defects or their relatively low representation in the dataset. From a model standpoint, this paper suggests that the issue is related to the “non-optimal control” of variability in the experiments. Specifically, the design of ConvAK has considerable autonomy in terms of kernel shape, size, and stride configuration, and changes in any of these parameters can affect the recognition performance of all categories. While the initial kernel we generate did improve the overall MAP and the detection of specific types such as Cg and Wf, the targeted design of the kernel shape for these features may not ensure adequate feature coverage during x and y coordinate transformations for irregularly extended variations of creases in the horizontal, vertical, or lateral directions. This, in turn, leads to a decline in the detection capability for this category.

As shown in Figure 9, for several defect types, including IN, PA, and RS, the YOLO-MAD model consistently exhibited higher confidence scores in its predictions compared to the baseline. This improvement underscores the effectiveness of our multi-scale feature learning and fusion strategies, primarily facilitated by components like BiFPN and the adaptive capabilities of Detect_DyHead. By integrating and processing information across various scales more effectively, the model has acquired a richer understanding of diverse defect characteristics, leading to more robust and confident detections. Furthermore, a notable improvement was observed for SC defects. In instances where the YOLOv8n baseline failed to detect certain scratches, YOLO-MAD successfully identified them. This enhanced capability can be attributed to the AKConv module, which is designed to learn and analyze multiple geometric structural features simultaneously. By adapting its convolutional kernels to capture the unique linear characteristics of scratches alongside other defect morphologies, the model achieves a more comprehensive perception of the defect’s geometry, enabling the detection of subtle or previously missed imperfections.

Conversely, for CR defects, the YOLO-MAD model occasionally displayed lower prediction confidence. This challenge likely stems from the inherent similarity between the visual characteristics of crazing defects and the background, both in terms of color and texture distribution, on the steel surface. When defect features closely resemble the surrounding environment, even advanced models may struggle to precisely delineate the defect boundaries and confidently classify them. This suggests a potential area for future research focusing on enhancing feature discriminability in visually ambiguous scenarios.

To further measure the detection proficiency of our proposed YOLO-MAD model, we analyzed its PR curves in comparison to the YOLOv8n, as presented in Figure 10 and Figure 11.

Overall, the PR curve for “all cases” of the YOLO-MAD model demonstrates superior performance compared to the baseline, indicating the enhanced general capability of our model. This improvement is consistently reflected across the PR curves for the majority of individual defect types, where YOLO-MAD shows a clear upward trend, signifying better precision at various recall levels. This suggests that the integrated modules effectively contribute to more accurate and comprehensive defect detection across most categories.

However, a notable exception is observed for CR defects, where the model’s performance, as depicted by its PR curve, shows a declining trend. This degradation is likely attributed to the challenges previously discussed, specifically the inherent similarity between the visual characteristics of crazing defects and the background’s color and texture distribution on the steel surface. Such visual ambiguity can hinder the precise extraction and fusion of geometric features, hindering the model’s ability to precisely distinguish these defects from their surroundings. This highlights a specific area where further refinement in feature learning for highly ambiguous defect types is warranted.

To obtain a finer insight into our model’s classification performance at the class level, we analyzed the confusion matrices for both the YOLOv8n baseline and our proposed YOLO-MAD model, as presented in Figure 12 and Figure 13, respectively.

A visual inspection reveals that the majority of correct defect identifications in the YOLO-MAD model exhibit deeper color intensities along the diagonal, indicating a general enhancement in the recognition accuracy for most defect types. This indicates that the modifications introduced in YOLO-MAD have broadly improved the model’s ability to correctly classify various steel surface defects.

However, certain challenges persist. Notably, the accuracy for CR defects appears to have decreased. Furthermore, a small number of IN samples, which were previously correctly classified by the baseline, are now misclassified into categories that were not erroneous before. A more detailed examination of the CR row in the YOLO-MAD confusion matrix highlights a significant issue: a substantial portion of CR defect instances are mistakenly identified as background. In fact, the proportion of CR defects misclassified as background often exceeds the proportion of CR defects correctly identified. This phenomenon strongly indicates that the structural similarity between crazing defects and the background’s texture and color distribution poses a considerable challenge for precise feature extraction and subsequent accurate classification.

For the majority of defect categories, Figure 14 Grad-CAM visualizations reveal that the YOLO-MAD model directs significantly increased emphasis on the precise details of the defects [43]. The concentrated red areas in the heatmaps often align well with the morphological traits of the defects. For instance, in SC and IN defects, the red regions clearly trace their respective linear or irregular geometric structures, indicating that the model effectively learns and leverages these distinct features for accurate identification. This enhanced focus on relevant defect features contributes to the overall improved performance observed in our model [44,45].

However, for certain defects like CR, the Grad-CAM heatmaps show a less concentrated and more dispersed red activation. This suggests that when the structural features of the defect are highly similar to the surrounding background environment, the model struggles to precisely pinpoint the discriminative regions. This difficulty in correctly selecting and focusing on the most relevant areas directly contributes to the decreased confidence and accuracy observed for CR defects in our earlier analyses.

4.3. The Result of Ablation Study

As shown in Table 4, all introduced module modifications significantly improved the model’s detection capabilities, thereby validating the effectiveness of our enhancement strategies.

The incorporation of the AKConv module brought about a substantial increase in Precision (P), reaching 0.688. AKConv’s strength lies in its adaptive, variable convolutional kernel size and position, which allow it to dynamically adjust its receptive field based on the unique geometric structures of steel defects, such as the linear characteristics of cracks or the punctiform distribution of patches. This precise feature capture capability effectively enhances the model’s ability to recognize various defect morphologies. The introduction of the BiFPN module demonstrated outstanding performance in Recall (R), improving to 0.712. BiFPN leverages an innovative bidirectional weighted feature fusion mechanism to efficiently integrate feature information from multiple scales. This approach enables the model to comprehensively utilize both global contextual information and local fine-grained details, significantly boosting its recall rate for diverse defect types. The Detect_DyHead module contributed most comprehensively to the overall model performance, achieving the best results for both mAP50 and mAP50-95 metrics, reaching 0.745 and 0.410, in that order. Detect_DyHead enhances the detection head’s adaptability by dynamically optimizing across three orthogonal dimensions: scale, space, and channel. This adaptive adjustment capability ensures that the model can achieve more precise and robust predictions when handling complex and varied defects.

The synergistic action of these modules significantly contributes to the overall performance gains of the model. AKConv focuses on refined local feature extraction, BiFPN excels at efficient multi-scale information integration, and Detect_DyHead optimizes the adaptability of the detection head at a macroscopic level. They form a beneficial complementary relationship, collectively enhancing the comprehensive performance of the YOLO-MAD model.

5. Discussion

The proposed YOLO-MAD framework demonstrates significant improvements in detection accuracy and robustness on both NEU-DET and GC10-DET datasets compared to recent state-of-the-art methods. By integrating three purpose-built modules—AKConv, BiFPN, and Detect_DyHead—YOLO-MAD addresses key challenges in industrial surface defect detection, including irregular geometry, multi-scale heterogeneity, and background interference. The quantitative results shown in Table 1 and Table 2 validate the performance advantages of each component when handling different types of defects, outperforming baseline YOLOv8n/YOLOv11n and showing competitive or superior results against recent advanced detectors like MSFT-YOLO [40] and YOLO-BA [46].

First, YOLO-MAD achieves superior performance on the IN, PA, and SC categories in the NEU-DET dataset, which often feature clear boundaries and structured shapes. These results can be largely attributed to the AKConv module. By incorporating deformable convolutional kernels and dynamic sampling, AKConv enhances the model’s sensitivity to local spatial irregularities and fine-grained features. Compared with standard convolution used in YOLOv8n and YOLOv11n, AKConv provides more adaptive receptive fields tailored to defect contours, thereby improving detection of irregular and small-scale patterns.

AKConv also presents considerable potential for further enhancement. Its detection performance is sensitive to parameters such as kernel shape and sampling strategy, making it relatively unstable under varying conditions. However, if its design is adapted to task-specific and even sample-specific characteristics, its adaptability can be substantially improved. Moreover, the integration of attention mechanisms can increase its generalization capability. For instance, placing a pretrained feature extractor before AKConv to guide the construction of irregular receptive fields may offer a promising optimization strategy.

Second, BiFPN plays a vital role in fusing multi-scale features, especially for defects with significant size variation or unclear boundaries, such as PA and RS. With its bidirectional flow and learnable fusion weights, BiFPN allows semantic features from different depths to be dynamically reweighted and aggregated, thereby enhancing informative cues while suppressing irrelevant ones. This design contributes to YOLO-MAD’s strong generalization across scale-diverse targets, a key advantage over methods lacking sophisticated multi-scale fusion. In future work, BiFPN could be extended with context-aware routing or graph-based attention modules to further improve its effectiveness in complex scenarios.

Third, Detect_DyHead greatly improves robustness in noisy and ambiguous environments. For instance, in GC10-DET, YOLO-MAD outperforms baseline methods and shows clear gains over MSFT-YOLO on challenging categories like Cg and Wf, which are known for their low contrast and irregular appearance. Detect_DyHead leverages spatial and scale-aware attention mechanisms combined with decoupled task heads to refine both localization and classification accuracy simultaneously. Its architecture is also naturally extendable to multi-task frameworks, allowing future integration with segmentation or category-aware supervision, potentially surpassing the capabilities of single-task detectors like YOLO-BA.

Nevertheless, the performance of YOLO-MAD remains limited in CR and Cr categories. These defects often exhibit weak contrast and lack distinct structural patterns (as shown in Figure 15), making them visually similar to the background. As shown in the Grad-CAM visualizations (Figure 14), the attention responses for these categories are scattered or misaligned, indicating that the model struggles to focus on the relevant areas [47]. Future enhancements such as saliency boosting, contextual encoding, or multimodal guidance could improve the model’s ability to distinguish low-saliency defects.

From a deployment perspective, YOLO-MAD requires only 9.4 GFLOPs, making it lightweight and efficient for real-time applications in industrial settings, comparable to or even more efficient than YOLOv8n and MSFT-YOLO. Although the NEU-DET and GC10-DET datasets are derived from real-world environments, variations in lighting and the complexity of steel surfaces still pose challenges during actual industrial deployment. The experimental results in this paper show that YOLO-MAD still has a good detection performance when there are noise points on the steel surface. However, further fine-tuning training is still needed for strong and weak light conditions. Moreover, the heatmap shows that although the model has marked the edge target as a salient object, it has not performed bounding box detection, indicating that its detection performance for edge targets is relatively weak. However, YOLO-MAD has expandability and great room for improvement. It can handle various defect shapes and sizes, which enables it to be adjusted and improved for detection in real scenarios and makes it suitable for other tasks such as pipeline corrosion monitoring, weld seam inspection, and composite crack detection.

In summary, YOLO-MAD achieves a favorable balance between accuracy and efficiency. Compared with other YOLO-based variants such as MSFT-YOLO [35] and YOLO-BA, it emphasizes architectural synergy and semantic representation, showing strong practical potential and scalability. Future research may explore integrating domain priors, task-adaptive designs, and multimodal inputs to further improve generalization and performance.

6. Conclusions

This paper presents YOLO-MAD, a lightweight and modular object detection framework tailored for steel surface defect detection in industrial scenarios. To address key challenges such as irregular defect shapes, scale variation, and strong background interference, the model incorporates the following three targeted modules: AKConv for enhanced geometric feature extraction, BiFPN for efficient multi-scale fusion, and Detect_DyHead for decoupled task-aware attention. Built upon YOLOv8n, the proposed framework significantly improves detection performance on two benchmark datasets, NEU-DET and GC10-DET, outperforming several state-of-the-art methods across multiple challenging defect categories.

Through both quantitative evaluation and qualitative Grad-CAM visualization, this study reveals that AKConv performs best in capturing fine-grained patterns with clear structural boundaries (e.g., IN and SC), BiFPN demonstrates strong adaptability to scale and structural variance (e.g., PA and RS), and Detect_DyHead improves robustness in complex textures and noisy backgrounds (e.g., Cg and Wf). Notably, YOLO-MAD achieves these improvements with only 9.4 GFLOPs of computational cost, making it suitable for real-time and edge-level deployment in industrial settings.

Future work may explore the integration of domain-specific priors to fine-tune AKConv parameters, as well as the incorporation of saliency-based attention or contextual encoding to enhance the model’s representation of weak-contrast defects. Moreover, the modular architecture of YOLO-MAD provides a promising foundation for extensions to multi-task learning (e.g., detection + segmentation) and cross-modal defect analysis. Overall, YOLO-MAD strikes an effective balance between accuracy, efficiency, and scalability, offering practical value for intelligent industrial inspection systems.

Author Contributions

Conceptualization, H.D. and Y.C.; methodology, H.D.; software, H.D. and J.C.; validation, J.C. and H.Y.; formal analysis, J.C.; investigation, H.D.; resources, H.Y. and Y.C.; data curation, H.D.; writing—original draft preparation, H.D. and J.C.; writing—review and editing, H.D., J.C., H.Y., and Y.C.; visualization, J.C. and Y.C.; supervision, H.Y. and Y.C.; project administration, H.Y. and J.C.; funding acquisition, H.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author.

Acknowledgments

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.We thank the associate editor and the reviewers for their useful feedback that improved this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chen, Y.; Ding, Y.; Zhao, F.; Zhang, E.; Wu, Z.; Shao, L. Surface Defect Detection Methods for Industrial Products: A Review. Appl. Sci. 2021, 11, 7657. [Google Scholar]
Qiao, Q.; Hu, H.; Ahmad, A.; Wang, K. A Review of Metal Surface Defect Detection Technologies in Industrial Applications. IEEE Access 2025, 13, 48380–48400. [Google Scholar]
Lee, S.; Chang, L.M.; Skibniewski, M. Automated recognition of surface defects using digital color image processing. Autom. Constr. 2006, 15, 540–549. [Google Scholar]
Rattanaphan, S.; Briassouli, A. Evaluating Generalization, Bias, and Fairness in Deep Learning for Metal Surface Defect Detection: A Comparative Study. Processes 2024, 12, 456. [Google Scholar] [CrossRef]
Zhao, B.; Chen, Y.; Jia, X.; Ma, T. Steel surface defect detection algorithm in complex background scenarios. Measurement 2024, 237, 115189. [Google Scholar]
Zheng, X.; Zheng, S.; Kong, Y.; Chen, J. Recent advances in surface defect inspection of industrial products using deep learning techniques. Int. J. Adv. Manuf. Technol. 2021, 113, 35–58. [Google Scholar]
Jiang, P.; Ergu, D.; Liu, F.; Cai, Y.; Ma, B. A Review of Yolo Algorithm Developments. Procedia Comput. Sci. 2022, 199, 1066–1073. [Google Scholar]
Xu, H.; Zhang, Z.; Ye, H.; Song, J.; Chen, Y. Efficient Steel Surface Defect Detection via a Lightweight YOLO Framework with Task-Specific Knowledge-Guided Optimization. Electronics 2025, 14, 2029. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Ultralytics. Ultralytics YOLO. Available online: https://docs.ultralytics.com/ (accessed on 10 May 2025).
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018. [Google Scholar] [CrossRef]
Jocher, G. YOLOv5 by Ultralytics. 2020. Available online: https://github.com/ultralytics/yolov5 (accessed on 10 May 2025).
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Varghese, R.; Sambath, M. YOLOv8: A Novel Object Detection Algorithm with Enhanced Performance and Robustness. In Proceedings of the 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), Chennai, India, 18–19 April 2024; pp. 1–6. [Google Scholar]
Jocher, G.; Chaurasia, A.; Qiu, J. YOLO by Ultralytics. 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 10 May 2025).
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable Convolutional Networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
Ouyang, D.; He, S.; Zhang, G.; Luo, M.; Guo, H.; Zhan, J.; Huang, Z. Efficient Multi-Scale Attention Module with Cross-Spatial Learning. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. GhostNet: More Features from Cheap Operations. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 1580–1589. [Google Scholar]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 6848–6856. [Google Scholar]
He, L.; Zheng, L.; Xiong, J. FMV-YOLO: A Steel Surface Defect Detection Algorithm for Real-World Scenarios. Electronics 2025, 14, 1143. [Google Scholar]
Lu, M.; Sheng, W.; Zou, Y.; Chen, Y.; Chen, Z. WSS-YOLO: An improved industrial defect detection network for steel surface defects. Measurement 2024, 236, 115060. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Proceedings of the 29th International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2025; Volume 39, pp. 1137–1149. [Google Scholar]
He, K.; Gkioxari, G.; Dollar, P.; Girshick, R. Mask R-CNN. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Zhang, G.; Cui, K.; Hung, T.Y.; Lu, S. Defect-GAN: High-Fidelity Defect Synthesis for Automated Defect Inspection. In Proceedings of the 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2021; pp. 2524–2534. [Google Scholar]
Zabin, M.; Kabir, A.N.B.; Kabir, M.K.; Choi, H.J.; Uddin, J. Contrastive self-supervised representation learning framework for metal surface defect detection. J. Big Data 2023, 10, 145. [Google Scholar]
Lu, H.; Zhu, Y.; Yin, M.; Yin, G.; Xie, L. Multimodal Fusion Convolutional Neural Network with Cross-Attention Mechanism for Internal Defect Detection of Magnetic Tile. IEEE Access 2022, 10, 60876–60886. [Google Scholar]
Zhang, X.; Song, Y.; Song, T.; Yang, D.; Ye, Y.; Zhou, J.; Zhang, L. AKConv: Convolutional Kernel with Arbitrary Sampled Shapes and Arbitrary Number of Parameters. arXiv 2023. [Google Scholar] [CrossRef]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
Dai, X.; Chen, Y.; Xiao, B.; Chen, D.; Liu, M.; Yuan, L.; Zhang, L. Dynamic Head: Unifying Object Detection Heads with Attentions. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 7373–7382. [Google Scholar]
Wang, L.; Liu, X.; Ma, J.; Su, W.; Li, H. Real-Time Steel Surface Defect Detection with Improved Multi-Scale YOLO-v5. Processes 2023, 11, 1357. [Google Scholar]
You, C.; Kong, H. Improved Steel Surface Defect Detection Algorithm Based on YOLOv8. IEEE Access 2024, 12, 99570–99577. [Google Scholar]
Zhang, H.; Li, S.; Miao, Q.; Fang, R.; Xue, S.; Hu, Q.; Hu, J.; Chan, S. Surface defect detection of hot rolled steel based on multi-scale feature fusion and attention mechanism residual block. Sci. Rep. 2024, 14, 7671. [Google Scholar]
Yang, Y.; Feng, Z.; Jin, W.; Miao, P. ADD-YOLO: A new model for object detection in aerial images. Multimed. Syst. 2025, 31, 120. [Google Scholar]
Bao, Y.; Song, K.; Liu, J.; Wang, Y.; Yan, Y.; Yu, H.; Li, X. Triplet-graph reasoning network for few-shot metal generic surface defect segmentation. IEEE Trans. Instrum. Meas. 2021, 70, 5011111. [Google Scholar]
Song, K.; Yan, Y. A noise robust method based on completed local binary patterns for hot-rolled steel strip surface defects. Appl. Surf. Sci. 2013, 285, 858–864. [Google Scholar]
He, Y.; Song, K.; Meng, Q.; Yan, Y. An end-to-end steel surface defect detection approach via fusing multiple hierarchical features. IEEE Trans. Instrum. Meas. 2019, 69, 1493–1504. [Google Scholar]
Lv, X.; Duan, F.; Jiang, J.j.; Fu, X.; Gan, L. Deep metallic surface defect detection: The new benchmark and detection network. Sensors 2020, 20, 1562. [Google Scholar] [CrossRef] [PubMed]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single shot multibox detector. In Proceedings of the Computer Vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. Springer: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
Ma, X.; Deng, X.; Kuang, H.; Liu, X. YOLOv7-BA: A Metal Surface Defect Detection Model Based On Dynamic Sparse Sampling And Adaptive Spatial Feature Fusion. In Proceedings of the 2024 IEEE 6th Advanced Information Management, Communicates, Electronic and Automation Control Conference (IMCEC), Chongqing, China, 24–26 May 2024; Volume 6, pp. 292–296. [Google Scholar]
Guo, Z.; Wang, C.; Yang, G.; Huang, Z.; Li, G. MSFT-YOLO: Improved YOLOv5 Based on Transformer for Detecting Defects of Steel Surface. Sensors 2022, 22, 3467. [Google Scholar] [CrossRef] [PubMed]
Pang, J.; Chen, K.; Shi, J.; Feng, H.; Ouyang, W.; Lin, D. Libra r-cnn: Towards balanced learning for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 821–830. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Lipton, Z.C. The mythos of model interpretability: In machine learning, the concept of interpretability is both important and slippery. Queue 2018, 16, 31–57. [Google Scholar]
Shu, X.; Xu, L.; He, Z.; Sheng, L.; Ye, G.; Lu, X. Wafer Defect Detection Based on YOLO-BA. In Proceedings of the 2024 International Conference on Sensing, Measurement & Data Analytics in the Era of Artificial Intelligence (ICSMD), Huangshan, China, 31 October–3 November 2024; pp. 1–7. [Google Scholar]
Jezek, S.; Jonak, M.; Burget, R.; Dvorak, P.; Skotak, M. Deep learning-based defect detection of metal parts: Evaluating current methods in complex conditions. In Proceedings of the 2021 13th International Congress on Ultra Modern Telecommunications and Control Systems and Workshops (ICUMT), Brno, Czech Republic, 25–27 October 2021; pp. 66–71. [Google Scholar]

Figure 1. Overall structure of the model.

Figure 2. AKConv structure. The bottom part of the figure demonstrates the coordinate transformation: Original coordinates: Fixed sampling positions of the standard convolution kernel, without any offsets. Offsets: The trained offset values that refine the original coordinates. Modified coordinates: The resultant sampling pattern after offset application.

Figure 3. Initial sampling shape; (a) illustrates the flexibility of kernel size customization, while (b) provides an example of the kernel shape with a fixed size of

5 \times 5

.

Figure 3. Initial sampling shape; (a) illustrates the flexibility of kernel size customization, while (b) provides an example of the kernel shape with a fixed size of

5 \times 5

.

Figure 4. Simplified demonstration of the sampling process of AKConv on crazing and roll-in scale samples.

Figure 5. Comparison diagram of FPN, PANet, and BiFPN structures.

Figure 6. Structure of the Detect_DyHead, integrating a three-dimensional decoupled attention mechanism to enhance object detection performance. The figure illustrates how feature tensors are decomposed into level, space, and channel dimensions, and highlights how multi-task decoupling branches and a fine-grained dynamic router facilitate efficient collaboration across multi-scale and multi-task aspects.

Figure 7. Detailed schematic of the three-dimensional attention mechanism.

Figure 8. Overview of the NEU-DET dataset partitioning, illustrating the separation into training and testing subsets. This collection features images representing six prevalent steel surface imperfections: crazing, inclusion, patches, pitted surface, rolled-in scale, and scratches. Their respective abbreviations are CR, IN, PA, PS, RS, and SC. The dataset was randomly split into training and testing sets with a 4:1 ratio.

Figure 9. Comparison of predictive outputs across various defect categories. (a) Original images are shown. (b) Predictions made by YOLOv8n. (c) Predictions generated by YOLO-MAD. Rows are organized from top to bottom as follows: (1) CR, (2) IN, (3) PA, (4) PS, (5) RS, and (6) SC defects.

Figure 10. Precision–recall (PR) curves for the YOLOv8n model, illustrating its detection performance across various thresholds.

Figure 11. PR curves for the proposed YOLO-MAD model, showcasing its enhanced detection capabilities.

Figure 12. The confusion matrix for the YOLOv8n model, visually representing its classification performance. Each row indicates the true class, while each column represents the predicted class, with diagonal values showing correctly classified instances.

Figure 13. The confusion matrix for the proposed YOLO-MAD model, illustrating its enhanced classification accuracy across different defect types.

Figure 14. Comparison of prediction localization using Grad-CAM heatmaps. Grad-CAM (Gradient-weighted Class Activation Mapping) generates visual explanations highlighting the image regions that are most important for the model’s classification decision. In these heatmaps, redder and deeper colored areas indicate higher activation weights, providing a qualitative insight into the model’s focus. (a) Original images; (b) YOLOv8n Grad-CAM heatmaps; (c) YOLO-MAD Grad-CAM heatmaps. From top to bottom, rows display: (1) CR, (2) PS, (3) SC, (4) RS, (5) PA, and (6) IN defects.

Figure 15. Illustration of the CR type detection effect. The original image on the top shows that CR defects are often slender and extend in an irregular, winding manner, which poses challenges for convolutional feature extraction. The heatmap’s attention areas and prediction boxes fail to achieve precise coverage.

Table 1. Dataset summary and comparison. This table details key characteristics of the NEU-DET and GC10-DET datasets, including image count, number of defect categories, image resolution, data splitting, and annotation type, providing a comprehensive overview of the datasets used in this study.

Dataset	Image Count	Defect Categories	Resolution	Data Split (Train:Test)	Annotation Type
NEU-DET [35]	1800	6	$200 \times 200$	4:1 (80%:20%)	Bounding Box
GC10-DET [37]	2300	10	$2048 \times 1000$	4:1 (80%:20%)	Bounding Box

Table 2. Performance assessment of various models on the NEU-DET defect dataset. Note that the models YOLOv8n, YOLOv8l, YOLOv11n, and YOLOv11l in the table are the official versions without additional modules. (The bold numbers indicate that this method performs the best in this category of defects).

Method	mAP	CR	IN	PA	PS	RS	SC	GFLOPs
SSD [38]	63.8	47.3	68.5	88.6	68.4	54.7	55.0	281.9
YOLO-BA [39]	74.8	36.3	67.8	91.0	96.6	70.6	86.4	-
MSFT-YOLO [40]	75.2	56.9	80.8	93.5	82.1	52.7	83.5	-
WSS-YOLO [21]	82.3	58.1	80.9	93.9	94.2	73.1	93.9	7.7
YOLOv8n	71.2	43.3	78.8	92.4	83.6	48.8	80.1	8.1
YOLOv11n	71.1	49.3	79.5	92.2	80.2	60.6	65.0	6.3
YOLOv8l	72.5	52.0	75.1	90.7	84.4	55.0	77.6	164.8
YOLOv11l	71.6	41.2	80.4	92.4	81.0	56.4	78.2	86.6
YOLO-MAD	76.6	38.5	85.0	94.2	83.3	65.5	93.0	9.4

Table 3. Performance assessment of various models on the GC10-DET defect dataset. Notations: Defect types include punching, weld line, crescent gap, water spot, oil spot, scratches, inclusion, rolled pit, crease, and wrinkle, which are abbreviated as PU, Wl, Cg, Ws, Os, Ss, In, Rp, Cr, and Wf, respectively. (The bold numbers indicate that this method performs the best in this category of defects).

Method	mAP	Pu	Wl	Cg	Ws	Os	Ss	In	Rp	Cr	Wf
Libra Faster R-CNN [41]	58.8	99.5	42.9	94.9	72.8	72.1	62.8	18.8	37.4	17.6	69.3
RetinaNet [42]	65.5	79.6	91.5	94.3	79.1	62.0	66.4	29.7	33.9	35.2	77.0
WSS-YOLO [21]	72.0	98.2	95.2	93.5	87.3	57.9	62.8	38.0	35.4	58.1	93.4
YOLOv8n	63.6	98.2	92.2	90.4	80.5	69.9	61.4	31.3	8.4	32.6	71.0
YOLOv8l	66.1	96.9	94.0	91.8	86.5	69.5	58.2	38.8	17.4	38.0	70.2
YOLOv11n	63.1	95.8	89.5	90.1	81.6	68.8	57.9	39.4	5.6	28.7	73.2
YOLOv11l	66.9	96.4	95.1	93.3	83.9	71.5	62.5	39.3	22.1	34.7	70.7
YOLO-MAD	68.4	98.6	93.4	96.0	80.7	68.7	56.5	38.6	25.2	33.5	77.1

Table 4. Impact of Individual Module Integration on YOLOv8n Performance. This ablation study quantifies the contribution of each proposed module (AKConv, BiFPN, Detect_DyHead) to the overall detection accuracy of the YOLOv8n baseline model. Bold values indicate the best performance for each metric within the respective ablation row. Note that each “+” here refers to the individual addition of the improvement on the YOLOv8n model, not a cumulative addition. Results are presented as mean ± standard deviation over three independent runs, conducted to ensure the statistical robustness of the findings.

Model	P	R	mAP50	mAP50-95
YOLOv8n	0.646 ± 0.002	0.678 ± 0.003	0.712 ± 0.002	0.373 ± 0.001
+AKConv	0.688 ± 0.003	0.689 ± 0.003	0.725 ± 0.002	0.385 ± 0.004
+BiFPN	0.665 ± 0.002	0.712 ± 0.002	0.738 ± 0.005	0.398 ± 0.001
+Detect_DyHead	0.672 ± 0.003	0.708 ± 0.003	0.745 ± 0.001	0.410 ± 0.002

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ding, H.; Chen, J.; Ye, H.; Chen, Y. YOLO-MAD: Multi-Scale Geometric Structure Feature Extraction and Fusion for Steel Surface Defect Detection. Appl. Sci. 2025, 15, 7887. https://doi.org/10.3390/app15147887

AMA Style

Ding H, Chen J, Ye H, Chen Y. YOLO-MAD: Multi-Scale Geometric Structure Feature Extraction and Fusion for Steel Surface Defect Detection. Applied Sciences. 2025; 15(14):7887. https://doi.org/10.3390/app15147887

Chicago/Turabian Style

Ding, Hantao, Junkai Chen, Hairong Ye, and Yanbing Chen. 2025. "YOLO-MAD: Multi-Scale Geometric Structure Feature Extraction and Fusion for Steel Surface Defect Detection" Applied Sciences 15, no. 14: 7887. https://doi.org/10.3390/app15147887

APA Style

Ding, H., Chen, J., Ye, H., & Chen, Y. (2025). YOLO-MAD: Multi-Scale Geometric Structure Feature Extraction and Fusion for Steel Surface Defect Detection. Applied Sciences, 15(14), 7887. https://doi.org/10.3390/app15147887

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

YOLO-MAD: Multi-Scale Geometric Structure Feature Extraction and Fusion for Steel Surface Defect Detection

Abstract

1. Introduction

2. Related Work

2.1. YOLO

2.2. Industrial Defect Detection

3. Proposed Method

3.1. Enhancing Feature Extraction with AKConv

3.2. Improving Multi-Scale Feature Fusion with BiFPN

3.3. Detect_DyHead: Enhancing Detection with 3D Decoupled Attention

4. Experiment

4.1. Dataset and Experimental Configuration

4.2. Performance Comparison with SOTA Approaches

4.3. The Result of Ablation Study

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI