1. Introduction
In industrial manufacturing, hot-rolled strip steel, as an indispensable core material, finds extensive application in industries like machinery, automotive, and the energy industry. It holds significant importance in contemporary manufacturing processes. Due to elements such as the original manufacturing processes, production equipment, and human errors during production and transportation, hot-rolled strip steel may suffer from surface defects. The surface quality directly impacts the functionality and safety of the downstream products. These defects not only compromise the aesthetic quality of the products but, more importantly, they can severely affect the mechanical characteristics of the strip steel, such as fatigue strength, wear resistance, and corrosion resistance, thereby leading to quality issues and safety hazards. Therefore, developing automated defect detection technology to significantly improve detection accuracy and efficiency is crucial for the effective detection of defects on the surface of hot-rolled strip steel. This advancement will support the smart upgrading of the steel manufacturing sector and align with the demands of Industry 4.0. It holds significant importance for industrial safety and production [
1].
Traditional manual inspection methods, relying on optical detection equipment, semi-automatically identify the intricate and varied surface imperfections of hot-rolled strip steel. However, manual inspection requires a large amount of labor, is subjective and experience-dependent, and suffers from low efficiency and high miss detection rates. These shortcomings may lead to product quality issues, safety hazards, and additional costs [
2]. Although manual inspection methods played a crucial role in early industrial production, their limitations make automated detection an inevitable trend. Machine learning-based detection methods, including SIFT, HOG, and SVM, offer advantages over traditional image processing techniques. Unlike the traditional methods, machine learning can automatically learn features in a data-driven manner, allowing for more accurate and adaptive defect detection [
3]. Although machine learning techniques have shown progress in detecting defects, their limited feature representation capabilities make it challenging to adjust to intricate defect patterns in real-world working environments. Recent advancements in deep learning have greatly improved defect identification on the surface of hot-rolled strip steel. Deep learning-based object detection methods have been widely applied to surface defect detection, becoming an important research direction for industrial quality inspection automation [
4]. Therefore, this study employs object detection methods to address the problem of surface defect detection in hot-rolled strip steel. Mainstream object detection algorithms can be divided into the following two main categories: the two-stage approach, represented by Faster R-CNN [
5], and the one-stage approach, represented by YOLO [
6] and SSD [
7]. Two-stage algorithms offer higher detection accuracy but come with slower inference speeds, presenting challenges in balancing detection performance and inference efficiency. One-stage object detection algorithms treat the task as a regression problem, combining localization and classification, and directly output detection results. When compared with two-stage models, these one-stage models have simpler architectures and more efficient detection.
In response to the operational demands of blemish detection, many researchers have proposed improvement strategies based on object detection. Zhong et al. [
8] proposed a multi-stage method based on pooling attention mechanisms and cross-scale shallow feature enhancement, which addressed the challenge of accurately identifying and locating defects in low-contrast and cluttered background environments. Han et al. [
9] proposed a visual defect detection method for mechanical parts based on deep visual sensing technology, which addressed the inefficiency of traditional manual inspection methods and their inability to meet the demands of modern manufacturing. Guo et al. [
10] proposed a crack detection model that integrated Discrete Wavelet Transform and deep learning, which addressed the challenge of detecting cracks in underwater dams within complex underwater environments. Zhang et al. [
11] proposed an improved gas pipeline defect detection algorithm, which solved the problem of accurately and quickly detecting defects in natural gas pipelines. Wan et al. [
12] proposed an approach that integrated multiple information fusion strategies to enhance the YOLOv8 network, addressing the additional challenges caused by using grayscale images for object detection. Zhao et al. [
13] proposed a multi-scale adaptive fusion defect detection algorithm specifically designed for complex backgrounds, which addressed the issue of deploying high-precision detection models on resource-limited edge devices. Xia et al. [
14] proposed a dual-level efficient global algorithm, which addressed the issue of detecting small surface defects in steel using scarce information that included sparse features. Cui et al. [
15] proposed a task-aware attention network for weak surface defect detection, which addressed the issue of feature conflict and spatial misalignment between the classification and localization heads, which negatively impacted defect detection performance. Xie et al. [
16] proposed an efficient re-parameterized feature pyramid network detection method, which addressed the challenge of detecting complex surface defects in steel and enabled the real-time detection of steel surface defects with high efficiency. Chu et al. [
17] proposed a lightweight hot-rolled strip steel surface defect detection network based on an improved YOLOv8, which addressed the issues of low detection accuracy and long detection time caused by variations in defect size and image blurriness during the acquisition process. Wang et al. [
18] proposed a novel model for metal surface defect detection, which addressed common issues such as low detection accuracy, high leakage rates, and false detection rates. Zhong et al. [
19] proposed a lightweight network to optimize feature selection for defect recognition, addressing the issues of not being able to select the most beneficial features during feature extraction and the loss of key feature information during gradient sampling. Liang et al. [
20] proposed a lightweight network based on an attention mechanism, which included a deformable convolution feature extraction module and a stepwise attention mechanism module. This network addressed the real-time defect detection problem under complex operating conditions. Although the aforementioned methods have made some progress in specific application scenarios, they still have limitations. First, the existing methods suffer from an imbalance in multi-label extraction when handling defects with different labels, leading to a high missed detection rate for small targets. Second, mainstream object detectors typically employ separate classification and localization branches, resulting in insufficient feature sharing. This makes them prone to missing minor defects and suffering from localization deficiencies in high-noise environments. Finally, while some high-performance models have improved detection accuracy, they come with increased computational costs, making it challenging to balance detection performance and real-time processing in industrial inspection environments.
To overcome the aforementioned challenges, this paper introduces a lightweight algorithm for surface defect detection in hot-rolled strip steel, referred to as CTL-YOLO, which is built upon the YOLO11n object detection framework. The primary contributions of this work are outlined as follows:
In the Neck section, a Context-Guided Reconstruction and Cascaded Cross Fusion Pyramid Network (CGRCCFPN) is proposed for the effective integration of features across multiple scales features and the preservation of detailed information, enhancing small object detection performance;
In the Head section, a Task Variable Alignment Detection Head (TVADH) is proposed, utilizing shared convolutional layers throughout the entire process for parameter reuse, and obtaining joint features for both localization and classification, enhancing the synergy between localization and classification;
The LAMP channel pruning algorithm is used to further compress the model and improve its computational efficiency while maintaining detection accuracy.
In addition, we will introduce the model structure and principles of YOLO11, along with various enhancements and experiments conducted in line with the latest developments in this field, we will compare the performance of the proposed method against leading algorithms and evaluate its effectiveness. In conclusion, we will present a discussion and summary of our experimental findings.
2. Materials and Methods
2.1. Background Theory of YOLO11
The YOLO11 algorithm [
21] is a research achievement released by the Ultralytics team in September 2024. It introduces significant improvements and upgrades over the previous YOLO versions. The refined architectural development and refined training process enable improved processing speeds while preserving accuracy, making YOLO11 an ideal choice for object detection and other computer vision tasks. The YOLO11 includes five different models—YOLO11n, YOLO11s, YOLO11m, YOLO11l, and YOLO11x. The modules used in all five models are exactly the same, with the differences lying in the depth, width, and maximum channels, resulting in varying computational complexity and parameter counts. The overall structure consists of the Backbone network, the Neck network, and the Head network. The Backbone mainly consists of Conv, C3k2, SPPF, and C2PSA modules, which are used for feature extraction from the input image. The Neck is primarily composed of Upsample, Concat, and C3k2, facilitating feature fusion between shallow and deep information. The Head consists mainly of Conv, DWconv, and Conv2d, which are responsible for object classification and localization prediction. The overall network framework of YOLO11 is very similar to that of YOLOv8, with the main difference being improvements in the underlying components. In the Backbone and Neck, some parts use the C3K2 module for feature selection, while a C2PSA module is incorporated following the SPPF module to enhance feature selection and processing. Additionally, depthwise separable convolutions find use in some branches of the Head to reduce redundant computations and improve efficiency. In order to optimize both detection accuracy and computational resources, this paper selects YOLO11n as the baseline network for research.
2.2. Developed Overall Network Architecture
To enhance the recognition efficiency of the method in industrial blemish detection scenarios, this paper presents the efficiency of the CTL-YOLO detection technique in terms of resource utilization, based on the YOLO11n algorithm. The proposed method aims to maximize efficiency in detecting hot-rolled strip steel surface defects on terminal devices with limited computational power. First, this study proposes CGRCCFPN, a network for global feature fusion across multiple scales that combines the Rectangular Self-Calibration Module [
22], Ca Former [
23], and CGLU [
24]. Secondly, this study proposes TVADH, a Variable Alignment Detection Head designed by referencing the concepts of Group Norm [
25] and TOOD [
26]. Finally, this study uses the LAMP [
27] pruning algorithm to achieve global adaptive magnitude compression and simplification. An overview of the improved network configuration is presented in
Figure 1.
2.3. Context-Guided Reconstruction and Cascaded Cross Fusion Pyramid Network
Based on the feature fusion network’s performance in the hot-rolled strip steel surface defect detection tasks, where simple upsampling and concatenation operations are used to fuse feature maps, it lacks the dynamic modeling ability for contextual information of defects at different scales. The fixed structure of convolutional stacks struggles to adapt to the complex and varying defect morphology, and conventional pyramid structures tend to weaken shallow texture information during feature propagation. However, defect detection on hot-rolled strip steel surfaces heavily relies on pixel-level spatial features. To address this, this paper proposes a Neck network called the Context-Guided Reconstruction and Cascaded Cross Fusion Pyramid Network (CGRCCFPN). Its role is to aggregate global context through pyramid pooling, combine dynamic interpolation fusion mechanisms to enhance multi-scale feature representation capabilities, and embed a channel attention mechanism in convolutions. This allows for a collaborative optimization of local feature enhancement and global semantic guidance, effectively solving key issues such as multi-scale feature imbalance and high missed detection rates for small targets in hot-rolled strip steel surface defect detection.
The output feature maps P3, P4, and P5 from the Backbone network are employed to process the multi-scale detail maps for CGRCCFPN. These feature maps undergo pyramid pooling aggregation via the Pyramid Context Extraction (PCE) module. Multi-scale pooling is applied to each level’s feature map, generating global context features. These features are then fused within a multi-scale context through cascading and 1 × 1 convolutions, represented as follows:
where
represents the
l-th level feature map output by the Backbone network (e.g.,
),
is the size of the pooling kernel (with values of 1 × 1, 3 × 3, or 5 × 5), and
denotes the feature map obtained after performing adaptive average pooling on
.
where Concat refers to concatenating multi-scale pooling results along the channel dimension,
represents a 1 × 1 convolution (employed for reducing channels and feature fusion), and
refers to the globally fused contextual features. This helps to effectively combine features from various layers, improving the model’s contextual awareness. Through the Rectangular Self-Calibration Module (RCM), depthwise separable convolution and spatial attention are applied to reconstruct features, represented as follows:
where
refers to depthwise separable convolution,
is a convolution (used for generating spatial attention weights),
is the Sigmoid activation function, ⊗ denotes element-wise multiplication, and
represents the enhanced regional contextual features. The RCM facilitates spatial feature reconstruction and the extraction of pyramid context. It captures global information along both the horizontal and vertical axes, while also acquiring axial context to effectively model the key rectangular regions. The workflow of the PCE and RCM modules is shown in
Figure 2.
The Get Index Output (GIO) module separates the fused features into three hierarchical features, P3/8, P4/16, and P5/32, enabling feature layering, which makes the presentation more intuitive and facilitates code development. The FBM (Fuse Block Multi) performs a gated fusion of enhanced features from P5, P4, and P3 with the pyramid-extracted contextual features from P5, P4, and P3, as expressed by the following:
where
represents high-level information,
represents low-level information,
is the convolution kernel for transforming low-level features,
is the convolution kernel for transforming high-level features,
is the Sigmoid activation function, UpSample refers to bilinear interpolation upsampling, ⊗ denotes element-wise multiplication, and
is the fusion result. The Dynamic Interpolation Fusion (DIF) module adaptively adjusts the high-level semantic information and low-level detail information, as expressed by the following:
where
represents the learnable upsampling kernel,
is the scaling factor, and ⊕ denotes element-wise addition. The FBM and DIF are used for multi-scale information gathering. Through dynamic interpolation and multi-scale information gathering, the information enhances the model’s ability to represent multi-scale features and improves its ability to identify targets in complex backgrounds. The workflow is illustrated in
Figure 3.
The input information for multi-scale convolution decomposition is processed through the C3k2 branch to complete the feature reorganization stage. The CaFormer is used for local–global attention interaction to achieve context aggregation, and the CGLU is employed for feature selection to complete the gated nonlinear transformation. The C3k2-CaFormer-CGLU (C3k2_CFC) module combines convolutional attention with gating mechanisms, as expressed by the following:
where
represents the input feature map, CGLU refers to the Convolutional Gated Linear Unit, ⊗ denotes element-wise multiplication, DropPath is the Stochastic Depth technique, LayerNorm refers to Layer Normalization, and TokenMixer is the rectangular region attention mechanism based on CaFormer, as expressed by the following:
where
represent the Query, Key, and Value matrices, respectively, where
are learnable projection matrices, with
representing the scaling factor dimension.
denotes the rectangular mask, ⊗ denotes element-wise multiplication, and
is the scaling factor. The C3k2_CFC module enhances feature discriminability. The C3k2_CFC module is obtained by replacing the Meta Former Block of C3k2 with the CFCBlook we developed. To effectively reduce computational and parameter complexity, the C3k2_CFC module is placed at locations with smaller feature map sizes to output the features required for the final identifying head. The workflow is illustrated in
Figure 4.
In summary, CGRCCFPN achieves efficient multi-scale information gathering and detail preservation in hot-rolled strip steel surface blemish detection through context-guided feature reconstruction and a lightweight attention mechanism. It adaptively adjusts the semantic weights of defects at different scales, enabling dynamic context awareness and enhanced feature discriminability. Testing on the NEU-DET dataset demonstrates that the algorithm surpasses the traditional Neck structures.
2.4. Task Variable Alignment Detection Head
Traditional detection heads adopt a parallel classification and regression branch design, and their feature fusion strategy lacks task-oriented guidance, which suppresses the discriminability of multi-scale defects. The fixed convolution kernels are unable to dynamically adapt to spatial shifts in intricate textured context, resulting in a decrease in localization accuracy. In the hot-rolled strip steel surface defect detection scenario, defects often exhibit characteristics such as varying scales, irregular shapes, and low contrast. To address this, this paper proposes a Head network called Task Variable Alignment Detection Head (TVADH), as shown in
Figure 5.
Two layers of lightweight Conv_GN (Group Normalization + Depthwise Separable Convolution) are used to collect basic information and output the information dimensions. This reduces the feature distribution shift caused by uneven lighting in industrial scenes, decreases noise interference in industrial images, and lowers computational complexity, thus meeting the real-time detection requirements. Multi-scale context information is preserved as shared feature extraction. The Task Decomposition module reorganizes features through channel attention, allowing the classification branch to focus on defect semantic information, while the regression branch learns spatial geometric features, achieving dual-path task decoupling and decomposition, as expressed by the following:
where
represents the feature map, AvgPool refers to average pooling, CLSDecomp is classification decomposition, and REGDecomp is regression decomposition. In the regression branch, the Generator mask and offset uses a 3 × 3 convolution to generate the offset and modulation factor to predict the offset. DCNV2 [
28] refers to the dynamic deformable convolution, which performs adaptive deformation feature alignment, as expressed by the following:
where
represents the pixel coordinates on the feature map,
is the offset of the sampling point relative to the center coordinate,
is the predicted offset,
is the spatial modulation factor,
is the fixed convolution kernel weight, and
is the aligned regression feature map. The regression branch enables the convolution kernel to adaptively deform to the local irregularities of the hot-rolled strip steel defect edges, improving the localization accuracy of small target defects. In the classification branch, Conv ReLU Conv Sigmoid uses two levels of 1 × 1 convolutions to compress the channels, and a 3 × 3 convolution generates the spatial probability map. Multiply is used to re-weight the features, with pixel-wise multiplication suppressing background noise. The spatial probability map highlights the defect regions and suppresses false positives caused by the surface texture. Finally, the independent channel attention modules, Conv_reg and Conv_cls, are used to decouple classification and regression features, alleviating task conflicts.
In summary, TVADH predicts the convolution kernel offsets based on feature content, and the variable adjustment of convolution sampling points can enhance the localization robustness of irregular defects. The entire process utilizes the shared convolution layer parameters for reuse, considerably lowering the params compared to traditional detection heads. Its performance on the NEU-DET dataset demonstrates excellent results in comparison with the conventional Head structures.
2.5. Layer-Adaptive Magnitude-Based Pruning
Deep neural networks commonly experience parameter superfluity and inefficient feature representation, which severely restricts the deployment performance of lightweight detection algorithm on weak computing power terminals. To address this, this study introduces the Layer-Adaptive Magnitude-Based Pruning (LAMP) [
27] algorithm to optimize the proposed algorithm. LAMP establishes an inter-layer importance adaptive evaluation mechanism and uses dynamic thresholds to protect shallow, fine-grained information. This facilitates the precise pruning of the model parameters while preserving the ability to express key features, thereby solving the challenge of balancing computational efficiency and detection accuracy. Pruning refers to the use of the LAMP algorithm to prune the structurally improved YOLO model of redundant parameters.
The channel pruning strategy of LAMP is based on the core principle of achieving adaptive inter-layer sparsity allocation through weight magnitude analysis, rather than pruning the entire network uniformly. For a feedforward neural network with depth g, given the weight tensors of each layer
, the algorithm first flattens the weights of each layer into one-dimensional vectors and arranges them in ascending order, ensuring that for indices
,
. The LAMP score for each weight
is defined as follows:
where
and
are index values of the weights, and
and
represent the weight elements corresponding to indices
and
. This score establishes a relative evaluation system for weight importance through normalization. The numerator term
characterizes the local importance of the target weight, while the denominator term
quantifies the global influence of this weight within the layer. It is expressed by the following formula:
A larger weight magnitude corresponds to a higher LAMP score, which provides a theoretical basis for selecting the pruning threshold. By dynamically adjusting the sparsity thresholds of each layer, the algorithm systematically removes low-scoring connections until the desired global compression rate is achieved. Finally, the YOLO model, pruned of redundant parameters, usually requires fine-tuning to restore or improve the performance prior to pruning.
In summary, the LAMP algorithm introduced in this study adaptively evaluates important functions and protects key feature layers through a dynamic threshold mechanism. It significantly reduces the sophistication and data of the algorithm, innovatively addressing the model lightweighting challenge in industrial detection scenarios. Testing on the NEU-DET dataset demonstrates that its performance meets the dual requirements of efficient and precise defect detection for hot-rolled production lines.
5. Conclusions
This study proposed applying the CTL-YOLO detection algorithm on the NEU-DET and GC10-DET datasets to explore the balance between computational efficiency and detection accuracy in hot-rolled strip steel surface defect detection. This research is a valuable supplement to the existing work, addressing gaps in the current studies and providing the necessary references for the steel industry. First, we explored the relationships among the Neck structure of YOLO11, multi-scale information fusion, and small target recognition. We proposed the CGRCCFPN feature fusion network, which achieved a more efficient fusion of multi-scale information, as well as better preservation of details. Next, we investigated the strategy of information sharing between the classification and regression branches in the Head structure. We introduced the TVADH network, which enabled shared convolutional layer parameter reuse between classification and regression, enhancing defect detection in complex backgrounds. Lastly, we applied the LAMP algorithm to prune redundant parameters of the YOLO model, decreasing the model’s computational and parameter, ensuring efficient operation even in weak computing devices. The experimental results showed that on the NEU-DET and GC10-DET datasets, CTL-YOLO achieved mAP50 values of 77.6% and 73.6%, separately, improving by 3.2 and 5.4 percentage points compared to YOLO11n. The GFLOPs were reduced to 2.0 and 4.1, which was a decrease of 68.3% and 34.9%, respectively. Params were reduced to 0.40 M and 0.94 M, which was a reduction of 84.5% and 63.6%, significantly outperforming the other tested models, especially in balancing computational efficiency and detection accuracy. The model size was reduced to 1.2 MB, meeting industrial embedded deployment standards. Of course, this study has some limitations, e.g., the FPS level is below that of the baseline algorithm. However, it still meets the real-time identification needs in industrial applications. In the future, we plan to combine knowledge distillation and quantization techniques to further compress the model size while ensuring its inference ability and generalization capability. Additionally, we will explore the applicability of different pruning methods on various neural network architectures to optimize deep learning models for more practical application scenarios.