For bolt object detection, this paper initially used the four YOLOv5 network models (n~x) for detection, but neither detection accuracy nor speed was satisfactory; subsequently, the YOLOv8 model was used for detection, which yielded results more in line with expectations. To achieve a network model with higher accuracy and detection speed for underframe bolt detection, the YOLOv8s network structure was improved. The specific improvements are as follows:
3.1. SFC2f Module
The C2f module is a crucial component in the YOLOv8 algorithm, achieving richness and accuracy in feature extraction through two convolutional layers and feature fusion. It consists of convolutional layers, feature fusion, activation functions, and normalization layers. Its workflow includes initial convolution, intermediate feature map splitting, second convolution, and feature map concatenation. The C2f module enhances feature richness and computational efficiency and is designed with modularity for easy integration. However, for underframe bolt detection, its complexity, high memory usage, and parameter redundancy increase the difficulty of training and inference. Therefore, by altering the module structure, reducing memory usage, and improving parameter utilization, its performance in underframe bolt detection can be further enhanced.
Faster_Block [
15] is a new module structure designed for object detection, which performs feature extraction and transformation through a series of convolutional layers that help the network learn more discriminative features. Although the Bottleneck module in the C2f structure of YOLOv8 has a similar function, it performs less efficiently. Their specific structures are shown in
Figure 2.
The Faster_Block module replaces the 3 × 3 convolution with a 1 × 1 convolution while incorporating the design of a 3 × 3 partial convolution (PConv). PConv applies conventional convolution to only a portion of the input channels for spatial feature extraction, thereby reducing memory access and computational redundancy, which improves computational speed. The memory access of conventional convolution and partial convolution can be expressed by Equations (1) and (2):
where
and
are the height and width of the feature map,
is the size of the convolution kernel,
is the number of channels for conventional convolution, and
is the number of channels for partial convolution. In practical applications,
1/4, so the memory access of PConv is only 1/4 that of conventional convolution, with the remaining (
) channels not involved in computation, thus requiring no memory access.
As discussed above, replacing the Bottleneck module in the C2f structure with the Faster_Block module reduces memory access and improves target detection speed. In this paper, we replaced the Bottleneck in the C2f structure of the YOLOv8s model with a Faster_Block to create the FC2f module, optimizing feature extraction while minimizing the computational overhead. The structural comparison of the two modules is shown in
Figure 3.
Secondly, the Spatial and Channel Reconstruction Convolution (ScConv) proposed by Li et al. [
16] was integrated into the FC2f module to design the SFC2f module, which captures spatial relationships at different positions through localized convolution operations and adjusts channel weights to reconstruct information, thereby better capturing the correlation of different positions and channels, further suppressing redundant information, and improving its detection capabilities. At the same time, the SFC2f module adopts lightweight operations and feature reconstruction methods, reducing computational complexity and making model training and inference faster. The ScConv and SFC2f structures are shown in
Figure 4.
ScConv mainly consists of two structures: the Spatial Reconstruction Unit (SRU) and the Channel Reconstruction Unit (CRU). The SRU enhances the network’s representation capabilities through separation and reconstruction operations. The SRU processes the input feature map through the following steps:
The separation process uses Group Normalization (GN) to evaluate the information content in different feature maps. GN scales the input features based on the mean
and standard deviation
, followed by control through threshold gating. This process can be mathematically represented as:
where
γ and
β are learnable parameters, optimized during the training process. These parameters are typically initialized randomly and adjusted through backpropagation. Specifically, the optimization process involves minimizing the loss function using a gradient-based optimizer, such as SGD (stochastic gradient descent) or Adam. As the network iteratively processes the training data,
γ and
β are updated at each step based on the gradients of the loss with respect to these parameters, allowing them to learn the best values for capturing the relationships between input features and the target output, and
ϵ is a small constant used for numerical stability. The information evaluation of the feature map is shown in Equations (4) and (5):
After the separation operation, the feature maps are divided into with greater information and with less redundant information. The reconstruction process combines these features to reduce spatial redundancy and enhance relevant features. The reconstruction process is mathematically represented as:
The CRU aims to optimize redundant features in the channel dimension through lightweight convolution operations, enhancing representative features while maintaining computational efficiency. The CRU adopts a split-transform-fuse strategy, detailed as follows:
The spatially refined features
generated by the SRU are split into two parts, with channel numbers
and
, where
is the split ratio (0 ≤
≤1). Both parts are compressed using a 1 × 1 convolution kernel, resulting in
and
.This process can be mathematically described as:
The compressed features
and
are subjected to Global Weighted Consolidation (GWC) and Partial Weighted Consolidation (PWC), respectively. The output results are then combined to form the transformed features
and
. Mathematically, the transformation process is expressed as:
where
and
are the learnable matrices of GWC and PWC, respectively.
The final step uses a simplified SKNet method to adaptively merge
and
. Global average pooling is used to obtain the pooled features
and
, followed by Softmax operations to generate feature weight vectors
and
. Finally, under the guidance of the weights,
and
are merged along the channel direction to obtain the refined channel features
. This process is represented as:
Therefore, replacing the Bottleneck module in the C2f module with the Faster_Block module can reduce memory access and improve target detection speed. At the same time, by introducing ScConv into the FC2f module, not only is the model’s computational efficiency and feature extraction capability enhanced, but redundant information is also effectively suppressed, making the designed SFC2f model perform better in target detection tasks compared to the traditional C2f module. These improvements provide a new pathway for enhancing model performance and lay the foundation for further optimization and application.
3.2. Coordinate Attention Mechanism
During neural network training, attention mechanisms are often added to optimize the network model [
17]. They work by autonomously learning to reduce the learning weights of less important parts of the input data while enhancing the weights of more important parts. This paper adopts the coordinate attention (CA) mechanism [
18,
19], which can accurately capture the positional information of the target within the overall image data. The specific network model structure is shown in
Figure 5 [
20].
The X Avg Pool and Y Avg Pool function as coordinate information embedding mechanisms, addressing the issue of global pooling’s inability to preserve positional information, thereby enabling the attention module to capture spatial long-range dependencies with precise positional information. Specifically, a pooling kernel of size (H,1) is used to encode each channel along the x-axis, followed by a pooling kernel of size (1,W) to encode each channel along the y-axis. These two transformations aggregate features along two spatial directions, returning a pair of direction-aware attention maps that provide accurate positional information for regions of interest [
21]. This process can be expressed by Equations (15) and (16).
Coordinate attention generation involves concatenating the feature maps generated in two directions, followed by a shared 1 × 1 convolution
to transform the feature maps as expressed in Equation (17), and splitting the generated feature map into two separate tensors along the spatial dimension. Then, two 1 × 1 convolutional kernels,
and
, are used to transform the two tensors to the same number of channels as the input, as shown in Equations (18) and (19). Finally, they are expanded to obtain the form expressed in Equation (20), which serves as the final output attention weights [
22].
Here, is the intermediate feature map of spatial information in the horizontal and vertical directions, while and are the two separate tensors obtained by splitting along the spatial dimension. and are the feature maps of the same channel as the input, obtained through feature transformation from and , respectively. Finally, represents the attention weight obtained.
3.3. MPDIoU Loss Function Improvement
YOLOv8 employs the CIoU loss function and adds DFL loss to enable the network to more quickly focus on the target location and the distribution of neighboring areas. This approach enhances the model’s generalization in object detection under complex conditions. However, because the aspect ratio of the detection box is a relative value, it can cause the predicted box to increase in size, affecting the precise localization of some small bolt targets. To address this issue, this paper improves the loss function by adopting the MPDIoU loss function.
MPDIoU can more accurately measure the matching degree of bounding boxes, fully utilizing the geometric features of horizontal rectangles, and combines factors such as overlapping or non-overlapping regions, center point distance, and width and height deviations to comprehensively compare the similarity between the predicted bounding box and the ground truth bounding box.
As shown in
Figure 6, the process demonstrates that the blue box represents the ground truth box, the red box represents the initial predicted box, and d1 and d2 are the distances between the top-left and bottom-right corners of the ground truth and predicted boxes, respectively. By iteratively reducing the values of d1 and d2, the predicted box gradually approaches the ground truth box, ultimately achieving better detection results. The specific expressions of the MPDIoU loss function are shown in Equations (21)–(23).
The core improvement of the MPDIoU loss function lies in its comprehensive assessment of the matching degree between bounding boxes. It considers the intersection over union (IoU) of the predicted and ground truth boxes, measuring their overlapping area. Additionally, by calculating the distance between the center points of the predicted and ground truth boxes, it ensures precise positional matching. Considering the differences in width and height between the predicted and ground truth boxes further enhances the accuracy of matching. With these improvements, MPDIoU performs better in handling small bolt targets, allowing the computation process to converge faster and making detection more efficient, while addressing the issue of identical loss values for the same aspect ratio.
3.4. LAMP Score-Based Pruning Algorithm
LAMP (Layer-Adaptive Magnitude-based Pruning) is a pruning algorithm used for deep neural networks. It selects and removes redundant weights in the network, reducing the model’s computational load and storage requirements while maintaining performance as much as possible [
23]. The specific scoring formula is shown in Equation (24).
In the equation,
represents the weight,
and
represent the
terms mapped by indices
and
, respectively, with
and
corresponding to the indices of the weights sorted in ascending order. The LAMP score measures the relative importance of target connections among all “surviving” connections in the same layer, and prunes the weights with the lowest LAMP scores in each layer until the pruning requirement is met. This pruning method ensures that at least one weight is retained in each layer, and the retained weights are the most important ones in each layer. The pruning method is shown in
Figure 7.
The main process of LAMP score pruning is as follows: (1) Obtain the weight file trained with YOLOv8s and perform initialization; (2) Calculate the square of the magnitude of the connection weights and normalize it to the sum of the squared magnitudes of all “surviving weights” in the same layer to obtain the LAMP score; (3) Based on the LAMP scores, select the connections with lower scores and prune the corresponding number of connections according to the pre-set global sparsity requirement; (4) Remove the selected connections from the model by setting their weights to zero; (5) Retrain the pruned model to recover any potential performance loss during pruning; and (6) Evaluate the performance of the pruned model to ensure that the pruning operation has not significantly reduced the model’s accuracy.
Through the LAMP score pruning algorithm, SFCA-YOLOv8s can effectively remove redundant weights in the network, thereby reducing the model’s computational load. Although the LAMP pruning algorithm removes some weights, its adaptive selection mechanism ensures that the retained weights are the ones that contribute the most to the model’s performance. Therefore, after introducing the LAMP pruning algorithm, SFCA-YOLOv8s can further optimize while maintaining its original accuracy.