3.1. SC-AttentiveNet Network Overview
As shown in
Figure 2, we propose SC-AttentiveNet, a novel network inspired by the CenterNet architecture. This design consists of a backbone for feature extraction, a neck for feature fusion, and a head for final recognition.
The detection head of this model draws on the anchorless frame detection idea of CenterNet, which gets rid of the dependence on the a priori frame of the traditional detection method, simplifies the detection process, and improves the ability to capture small surface defects on copper strips.
Firstly, the backbone network of the SC-AttentiveNet model is based on ConvNeXt V2 [
25] redesigned SCGNNet feature extraction module, which aims to accelerate the inference speed while maintaining higher accuracy and effectively reduce the number of parameters and computational complexity of the model. By introducing a lightweight feature extraction module, we are able to achieve faster response time in the detection of surface defects on copper strips, which is crucial for real-time monitoring and fast decision-making in production lines, especially in resource-constrained industrial environments.
Secondly, the SPPF [
26] structure and PSA module in the YOLOv10 [
27] model are adopted in SC-AttentiveNet, which are introduced to enhance the feature extraction capability and spatial information utilization, and improve the robustness of the model. In the detection of surface defects on copper strip, the SPPF module is able to extract defect features at different scales more effectively by spatial pooling of the feature map, which makes the model perform more stably in the face of a variety of surface defects (e.g., scratches, pits, etc.). The PSA module, on the other hand, improves the model’s ability to discriminate defects by focusing on important features, which ensures that small defects on the surface of copper strips can be accurately recognized even under complex backgrounds.
Finally, in the feature fusion part, we designed the HD-CF Fusion Block module based on the C2FCIB module, the Dysample up-sampling module, and the HAM attention mechanism, and constructed the feature fusion network based on the module. This design significantly enhances the diversity and fine granularity of features in the defect detection of copper strip, which enables the model to reduce information loss while retaining critical information, thus improving the detection accuracy. Through multi-level feature fusion, the model is able to better capture the relationship between different features, which significantly enhances the defect detection of copper strip in complex scenarios, which is crucial for improving production efficiency and ensuring product quality.
SCGNNet, as the feature extraction network, focuses on the lightweight design of the overall model. It significantly enhances computational efficiency while maintaining high detection accuracy for the surface defects on copper strips. The SPPF structure and PSA module play complementary roles: the SPPF structure extracts multi-scale defect features through spatial pooling, improving robustness across various defect types, while the PSA module emphasizes critical spatial information to enhance defect recognition, particularly for small-scale defects in complex industrial backgrounds.
The HD-CF fusion module, built upon the C2FCIB module, the anomaly up-sampling module, and the HAM attention mechanism, unifies these extracted features through a multi-level fusion process. This integration minimizes information loss while maximizing feature diversity, ensuring that critical details of defects are preserved. Each component is specifically chosen based on its unique contribution to addressing the challenges of copper strip defect detection, such as handling subtle defect patterns, maintaining robustness in complex scenarios, and supporting real-time applications. This design ensures that the superior performance of the model is not merely the result of module stacking but stems from a deeply integrated and purpose-driven approach.
In brief, SC-AttentiveNet is a purpose-driven and deeply integrated network specifically designed for the unique challenges of surface defect detection on copper strip. The lightweight SCGNNet backbone, the complementary SPPF and PSA modules, and the HD-CF Fusion Block are not merely stacked components but are interconnected in a synergistic manner to address critical industrial requirements. Each module plays a distinct role in optimizing computational efficiency, enhancing feature extraction, and improving detection accuracy under complex scenarios. This integrated design ensures that SC-AttentiveNet achieves superior performance, balancing precision and real-time applicability in resource-constrained environments.
3.3. SPPF-PSA Module
In this paper, we integrate the fast spatial pyramid pooling (SPPF) structure with the partial self-attention (PSA) module to leverage their complementary strengths and address the specific challenges of detecting surface defects on copper strips. These challenges include significant variability in defect size, shape, and appearance, along with the demand for accurate and efficient detection in high-resolution images.
The SPPF structure is embedded in the last layer of the convolutional neural network, designed to extract richer global and multi-scale feature information. It generates a series of fixed-length feature vectors by hierarchically partitioning the input feature maps and applying different sizes of pooling kernels (5 × 5, 9 × 9 and 13 × 13) for maximum pooling. These feature vectors contain spatial information at multiple scales, which greatly enhances the model’s ability to detect targets of different sizes and scales. In addition, the SPPF structure effectively addresses the challenge of scale variation by increasing the receptive field, reducing the network parameters, and enhancing the robustness of the model to the spatial invariance of the target. The SPPF structure is shown in
Figure 5a.
On the other hand, the introduced PSA [
27] module solves the problem of high computational complexity and memory requirements by optimizing the computational efficiency of the Self-Attention mechanism. The features input to this module are convolved 1 × 1 and then divided equally into two parts on the channel, and only one of them is fed into multiple PSA blocks consisting of Multihead Self-Attention (MHSA) and Feedforward Network (FFN). This process ensures that the feature information is fully utilized, while the fusion of features is achieved through re-splicing and 1 × 1 convolution. In addition, the module sets the dimensions of queries and keys in MHSA to half of their values and replaces LayerNorm with BatchNorm for faster inference. Therefore, applying the PSA module to the model after the lowest resolution feature map can effectively reduce the computational complexity overhead. The structure of the PSA module is shown in
Figure 5b.
By integrating the SPPF and PSA modules, our model achieves complementary benefits: the SPPF structure excels at extracting global and multi-scale features to detect defects of varying sizes, while the PSA module enhances feature representation and computational efficiency. This tailored combination is particularly effective for the unique characteristics of copper strip defect detection, enabling improved accuracy and robustness with minimal computational cost. The integrated SPPF-PSA structure is shown in
Figure 5c.
3.4. HD-CF Fusion Block
The HD-CF Fusion Block is specifically designed to tackle the unique challenges of copper strip defect detection, including the need for effective multi-scale feature extraction, precise attention mechanisms, and artifact-free up-sampling in high-resolution images. By integrating the C2FCIB module, HAM hybrid attention mechanism, and Dysample up-sampling operator, the block provides a comprehensive solution that enhances the overall detection performance. These components work together to efficiently fuse multi-level feature information, resulting in detailed and accurate feature representations. This design also improves feature transfer across network layers while significantly reducing computational redundancy, making it well suited for real-time industrial applications.
Each component of the HD-CF Fusion Block contributes uniquely to its performance. The C2FCIB module employs a rank-guided optimization strategy to enhance multi-scale feature aggregation, ensuring a balanced trade-off between capacity and efficiency. The HAM hybrid attention mechanism refines features in both channel and spatial dimensions, enabling the network to focus more effectively on defect-critical regions and complex textures. Meanwhile, the Dysample operator addresses common up-sampling challenges, such as blurring and artifacts, by using a dynamic point-sampling approach that reconstructs high-quality feature maps without introducing additional computational overhead. Together, these innovations enable the HD-CF Fusion Block to achieve superior defect detection with improved precision and efficiency.
In the neck part of the SC-AttentiveNet network, feature maps from the HD-CF Fusion Block are fused with those from the backbone network using the Add operation for the following reasons:
- (1)
Dimensional consistency: the Add operation maintains consistent feature map dimensions, ensuring smooth integration without the complexity of expanding feature map channels, as would occur with Concat.
- (2)
Computational efficiency: the Add operation does not increase the number of channels, keeping the model lightweight and reducing computational load, which is crucial for real-time defect detection in resource-constrained environments.
- (3)
Stability and performance: by preserving dimensional consistency, the Add operation ensures network stability, preventing instability during training and enabling more effective fusion of features.
This approach aligns with the goal of optimizing both performance and computational efficiency. Next, a detailed introduction of each module within the HD-CF Fusion Block will be provided.
3.4.1. C2fCIB Module
In the YOLOv8 network, the C2f module (CSP Bottleneck with 2 Convolutions) is one of the core components responsible for cross-stage feature aggregation. This module effectively reduces the computational effort while compressing the model and maintains or enhances the expressive power of the model through the aggregation of multi-scale information.
However, the researchers at YOLOv10 found that the C2f module is deficient in balancing capacity and efficiency through intrinsic rank analysis. To this end, they propose a rank-guided C2fCIB-based module that aims to reduce the complexity of the redundancy phase by optimizing the architecture. Firstly, they designed the compact inversion module (CIB), which combines low-cost deep convolution for spatial mixing and efficient point-by-point convolution for channel mixing. Subsequently, they propose a rank-guided module allocation strategy based on rank-guidance to improve the efficiency while maintaining the performance of the model. Specifically, for a given model, the basic modules of the previous stages are sorted by the intrinsic rank of each stage from low to high, and the CIB is used to replace the basic modules of the previous stages sequentially. If the performance is not degraded, the replacement continues to the next stage, otherwise the replacement stops. With this approach, adaptive compact module design is realized at different stages and model sizes, which improves efficiency without compromising performance. The C2fCIB module structure is shown in
Figure 6.
3.4.2. Hybrid Attention Mechanism
The structure of the HAM [
30] hybrid attention module is shown in
Figure 7, which consists of two sub-modules in a sequential order: the CAM channel module and the SAM spatial module. The CAM module generates a one-dimensional attention map, while the SAM module generates a pair of two-dimensional attention maps. This module enables the input feature maps to be refined in channel and spatial dimensions, and it can be embedded into any prior convolutional neural network to enhance the feature representation capability.
Assuming an input feature map F with dimensions H × W × C, the first step of the HAM module is to generate a one-dimensional channel attention tensor A
C, which is then multiplied with the input features to obtain the channel refinement feature F′. Next, the spatial attention submodule splits F′ into two groups F′
1 and F′
2 along the channel dimension, and computes the corresponding two-dimensional spatial attention tensor A
S,1 and A
S,2 for F′
1 and F′
2. Subsequently, these two attention tensors are multiplied by F′
1 and F′
2 to produce the spatial refinement features F″
1 and F″
2, respectively. Finally, these two features are summed up to obtain the final refinement feature F″. The above process can be expressed by the following mathematical formula:
The structure of the CAM module is shown in
Figure 8.
First, the module aggregates the spatial dimension information through average pooling and maximum pooling operations to generate the average pooled feature F
Cavg and the maximum pooled feature F
Cmax, respectively. Subsequently, these two tensors are fed into the adaptive mechanism module to obtain the enriched feature F
Cadd, which is computed as in Equation (8). The adaptive mechanism module contains two trainable floating-point parameters, α and β, and their values are between 0 and 1. This module not only introduces an adaptive mechanism between the average pooling and maximum pooling features, but also effectively enriches the feature information in the image feature extraction process.
Designs that use a multilayer perceptron (MLP) to compute channel attention not only make the model more complex, but also suffer from a reduction in channel dimensionality. Therefore, the CAM module proposes to use a fast one-dimensional convolution to capture the interactions between channels, assuming that the size of this convolution kernel is k, and the value of k is adaptively determined by the number of channels C. The computational procedure is described by the following Equation (9):
where γ and b denote hyperparameters, and the closest odd number of the final calculation result is taken as the k value.
Finally, the feature map F
Cadd output by the adaptive mechanism module is activated after a series of convolutions using the Sigmoid function on the output feature tensor. In short, the computational process of the CAM module can be summarized in the following Equation (10):
where σ denotes the sigmoid function and C1D
1×k denotes a one-dimensional convolution with kernel size k.
The structure of the SAM module is shown in
Figure 9.
The features refined by the CAM after A
C refinement by the channel attention module are passed into the SAM module, which first performs channel separation operations to obtain the important channel group F′
1 and the minor channel group F′
2. Then, average pooling and maximum pooling operations are performed on both F′
1 and F′
2 to summarize the channel dimensionality information, and two pairs of 2D feature maps are generated, F
S,1avg, F
S,1max, and F
S, 2avg, F
S,2max. The F
S,1avg, F
S,1max, and F
S,2avg, F
S,2max are, respectively, subjected to join operation to generate a pair of feature descriptors. Subsequently, the concatenated feature descriptors are subjected to a convolution operation through a 7 × 7 shared convolutional layer to obtain a pair of 2D attention maps. Finally, spatial attention maps A
S,1 and A
S,2 are generated by batch normalization, the ReLU function, and the sigmoid function. The above process can be mathematically represented as:
where φ denotes a series of nonlinear operations and C2D
7×7 denotes a shared convolution with kernel size 7 × 7.
3.4.3. Dysample Up-Sampling Operator
Upsampling is the process of enhancing the resolution or dimensionality of low-resolution images or data through specific methods designed to enhance image details and information, improve image quality, and enhance the performance of deep learning models. In this paper, we introduce the Dysample [
31] module to optimize common problems in the up-sampling process. Unlike traditional kernel-based dynamic up-sampling methods, the Dysample module uses point sampling to reconstruct the up-sampling process by dynamically selecting sampling points directly on the feature map, instead of generating a dynamic convolutional kernel to reorganize the feature map. This approach significantly reduces the computational complexity and does not rely on high-resolution feature bootstrapping for feature reconstruction, which effectively reduces the loss of details, image blurring, and the appearance of artifacts and jaggedness effects, making the upsampled image more realistic, and thus improving the overall performance of the deep learning model. The pseudo-code of the Dysample upsampling operator is given in Algorithm A1.
The schematic diagram of the Dysample module is shown in
Figure 10.