The improved model proposed in this study is based on the YOLOv5 algorithm, as depicted in
Figure 1. YOLOv5 is divided into four versions based on network depth and width: YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x. These versions differ in network depth, width, and computational complexity, with YOLOv5s being smaller and more lightweight, thus making it suitable for resource-constrained scenarios, while YOLOv5x is deeper and more accurate, which renders it suitable for applications requiring high precision.
In response to the lightweight and real-time requirements of defect detection in the industry, YOLOv5s was improved in this study while maintaining detection accuracy. The focus of the improvement was on optimizing the backbone network, neck, and head layers to enhance detection speed and accuracy. Particularly for defect detection in small-scale and complex environments, improvements were made in model structure and parameter tuning, resulting in an enhanced performance across different versions of YOLOv5.
2.1. Para_CBAM Attention Mechanism Design
CBAM is a structurally compact and efficient attention module that continuously attends to both the channel and spatial information in images. It achieves this by sequentially integrating two sub-modules: spatial attention and channel attention. This allows the network to first identify important feature channels and then further focus on important spatial regions within these channels. This fine-grained attention adjustment strategy enables CBAM to achieve significant performance improvements across various visual tasks.
The overall structure of CBAM (Channel-Wise Attention Module) is illustrated in
Figure 2. Initially, global average pooling and global max pooling are utilized to extract global spatial information. Subsequently, these pieces of information are utilized to learn the importance weights for each channel. This step ensures that the proposed model can identify and emphasize the feature channels that are most critical for the current task. Following this, the spatial attention module is introduced in CBAM, which further highlights the important spatial regions by considering information at each spatial position of the feature map. Channel-wise global max and global average pooling are performed on the feature map in this module, followed by a convolutional layer, which then ultimately generates a spatial attention map. This map guides the model’s focus toward crucial parts of the image.
The overall structure of CBAM (Channel-Wise Attention Module) is depicted in
Figure 2. Initially, global average pooling and global max pooling are applied to extract global spatial information. Subsequently, these pieces of information are utilized to learn the importance weights for each channel. This step ensures that the model can identify and emphasize the feature channels that are most critical for the current task. Following this, the spatial attention module is introduced in CBAM, which further highlights important spatial regions by considering information at each spatial position of the feature map. The feature map undergoes channel-wise global max and global average pooling in this module, followed by a convolutional layer, and then ultimately generates a spatial attention map. This map guides the model’s attention toward key areas of the image.
The features
obtained from global average pooling and the features
obtained from global max pooling are each processed through a fully connected layer with shared weights. Afterward, they are summed and processed through a Sigmoid activation function to generate the final channel attention feature, which is denoted as
, as shown in Equation (1):
In the above equation
—represents the Sigmoid activation function;
—denotes a Multilayer Perceptron (MLP) that includes a ReLU activation function;
—signifies the max pooling operation;
—indicates the average pooling operation.
After the channel attention module is executed, the spatial attention mechanism continues to provide additional focus on key spatial locations. Initially, local max pooling and global average pooling operations are applied to the input feature map
, albeit this time along the channel dimension, resulting in two two-dimensional feature maps
and
. Subsequently, these two feature maps are stacked along the channel dimension and passed through a 7 × 7 convolutional layer, which is followed by the application of the Sigmoid function to obtain the spatial attention map
, as shown in Equation (2).
In the equation
—represents the Sigmoid activation function;
—signifies a convolution layer with a 7 × 7 kernel size;
—denotes the stacking of feature maps.
The final output of the CBAM module is obtained by first adjusting the input feature map
with the channel attention weights
, and then it is further refined by applying the spatial attention map
.
In the equation
—element-wise multiplication.
Due to the serial fusion of the spatial and channel attention in the CBAM attention mechanism, this sequential fusion method implies that the weights of the subsequent attention modules are influenced by the results of the previous module, regardless of whether the channel attention processing or spatial attention processing is performed first. Specifically, the preceding attention module adjusts the input feature map to a certain extent, thereby affecting the weight allocation of subsequent attention modules. While this design performs well in many applications, its inherent sequential dependency limits the model’s ability to comprehensively capture features as each application of attention is based on a feature map that has already been “adjusted” by another type of attention.
In order to overcome this limitation and further enhance the proposed model’s performance, an improved CBAM structure is proposed in this paper, where the original serial fusion of attention is replaced with parallel fusion. In this improved structure, the channel attention and spatial attention modules are designed to independently process the original input feature map simultaneously rather than sequentially. This means that each attention module directly adjusts the original feature map rather than being based on another modified feature map. This parallel processing approach eliminates the sequential dependency between attention modules, allowing the model to capture information in the feature map more flexibly and comprehensively. The channel attention weights
and spatial attention weights
are simultaneously applied to the input feature map
. The improved formula is shown in Equation (5).
Through this parallel fusion design, the proposed model no longer needs to concern itself with the processing order of channel attention and spatial attention, thereby enabling more efficient and effective utilization of both types of attention to enhance feature representation. This improvement not only holds promise for enhancing the model’s performance across various visual tasks, but also provides a new perspective for the design of attention mechanisms. In practical applications, this parallel-fusion CBAM can be easily modified and integrated into existing convolutional neural networks, thereby opening up new possibilities for enhancing the network’s expressiveness and adaptability. The improved CBAM attention module in this paper is named Para_CBAM, and the overall improved structure is depicted in
Figure 3.
2.2. Optimizing Feature Convergence Networks
During the object detection process, due to the abundance of pixels in large objects, feature points are less likely to be lost in convolution operations, while feature points are more likely to be lost as they progress for small objects with fewer pixels. To further improve the proposed model’s ability to recognize complex scenes and subtle defects, optimizing the feature fusion network is crucial. Although the original YOLOv5 utilizes FPN + PANet to enhance feature fusion efficiency in multiple aspects, there is still room for improvement when dealing with finer targets and more complex backgrounds. BiFPN, with its efficient bidirectional fusion paths and cross-scale connections, not only enhances the detection capability for objects of different scales, but also improves the model’s ability to capture details while maintaining computational efficiency. This significantly enhances the performance of aluminum profile defect detection tasks. Therefore, BiFPN was adopted as the core network for feature fusion in this study.
The network structure of BiFPN is depicted in
Figure 4. Through carefully designed bidirectional fusion paths and cross-scale connections, the BiFPN network architecture achieves efficient and flexible feature fusion. Initially, it receives multi-scale feature maps from the base convolutional network, which exhibit varying resolutions and represent different semantic levels from shallow to deep layers. In the top-down path, BiFPN gradually conveys rich semantic information from higher layers to lower-level feature maps through upsampling and weighted fusion operations, thus enhancing their semantic representation capability. Simultaneously, in the bottom-up path, it transfers detailed information from lower layers to higher-level feature maps through downsampling and fusion operations, thus improving their spatial detail representation. Additionally, BiFPN establishes cross-scale connections to directly transmit information between the feature maps of different scales, significantly enhancing the efficiency of information flow. At feature fusion points, BiFPN employs learned weights for weighted fusion, adaptively adjusting the contribution of features from different sources to generate multi-scale feature maps fused with rich contextual information. These feature maps, processed by BiFPN, not only contain abundant semantic information, but also retain fine spatial details, thus providing robust feature support for downstream tasks such as object detection or semantic segmentation.
During feature fusion, BiFPN employs learnable weights to perform a weighted summation of features from different sources. If there are multiple feature layers
,
,…,
to be fused, each feature layer
is associated with a corresponding weight
. The fused feature
can be computed using Formula (6):
2.3. Improved Target Detection Layer
In YOLOv5, three detection layers are utilized to detect objects of different sizes. These detection layers are positioned at different levels of the network to predict targets within various scale ranges. In the original YOLOv5 model, the core of the network output layer consists of three detectors. These detectors perform object detection tasks using grid-based anchors on feature maps of varying sizes. After completing feature fusion, the detection layers output feature maps of three different scales: 80 × 80, 40 × 40, and 20 × 20. In tasks such as object detection, particularly in applications sensitive to details like aluminum profile defect detection, the model needs to be capable of identifying and locating defects of various sizes, including very small defects.
The 160 × 160-sized feature map provides the proposed model with finer spatial information, which is crucial for detecting small-sized or subtle defects. In applications like aluminum profile defect detection, defects often occupy only a small portion of the image, and their characteristics are not prominent. High-resolution feature maps carry rich detailed information, enabling the model to identify and locate small defects with higher precision. This size of feature map offers an appropriate balance, preserving detailed information with sufficient resolution without becoming excessively large, which would make the model difficult to handle. This balance allows the model to enhance detection accuracy and robustness while maintaining real-time performance. Therefore, the addition of a non-fused 160 × 160 high-resolution feature map enriches spatial information, aiding the model in better understanding image details and thus improving the accuracy of detecting small-sized targets.