1. Introduction
The cotter pin is a critical stabilizing component in power transmission lines, playing a vital role in maintaining the structural integrity of the transmission system. Given power transmission line’s complex and harsh operational environment, cotter pins are exposed to prolonged stress, making them susceptible to loosening and detachment. This risk poses a threat to the safety and stability of the transmission system, emphasizing the need for timely detection of cotter pin defects. Therefore, prompt identification of any defects in cotter pin is particularly important.
However, detecting cotter pins in power transmission lines presents unique challenges compared to general object detection.
Figure 1 shows a typical inspection image containing cotter pins. It is evident that in regular inspection images, cotter pins are usually not the focal point, making them extremely small and difficult to observe with the naked eye. A power inspection image containing two cotter pins with their positions marked by red bounding boxes is shown in
Figure 1. The original image dimensions are 2000 × 1125 pixels, while the annotated regions for the two cotter pins measure only 80 × 50 pixels and 75 × 75 pixels, occupying approximately 0.18% and 0.25% of the total image area, respectively, both well below the established threshold for small-object detection (typically <1% of image area). Therefore, the primary challenge in cotter pin defect detection is the small target size.
Figure 2a–c represents close-up slices of the regular, loose, and missing cotter pin forms, respectively. The differences between these three forms are pretty subtle. However, in inspection images taken from different angles and distances, the difference between
Figure 2a and
Figure 2c (representing different cotter pin forms) is slight. In contrast, the difference between 1 and 2 (representing the same cotter pin form) in image a is quite significant. Another challenge in cotter pin defect detection is the fine-grained problem of “small inter-class differences and large intra-class differences”. Additional factors include the diversity of inspection environments, the complexity of power facilities, changes in weather and lighting conditions, interference from human and natural factors, and limitations of image capture devices. This results in the background of power inspection images being generally complex, further increasing the difficulty of detecting missing cotter pin defects.
In traditional power line inspections, the images captured by drones contain a large number of cotter pins, which presents significant challenges in detection, resulting in high detection costs, low efficiency, and limited accuracy. Inspectors must manually examine each cotter pin one by one, which consumes considerable time and effort. This increases operational and maintenance costs and introduces the risk of missed or false detections due to human error. Therefore, integrating drone image acquisition with deep learning holds critical significance for industrial inspections [
1]. By leveraging deep learning techniques, we can develop models capable of automatically identifying cotter pin defects by leveraging deep learning techniques. These models can automatically analyze image or video data from power transmission lines, quickly and accurately detecting defective cotter pins. This not only reduces the workload of manual inspections and lowers operational costs, but also improves detection accuracy, ensuring the safety and stability of power transmission.
Research on cotter pin defect detection is relatively scarce and incomplete, both domestically and internationally, and it also has certain limitations. Early detection of power transmission line fittings and their defects primarily relied on manually designed features, resulting in low detection efficiency and accuracy. With the development of deep learning, which excels in feature extraction, processing, and localization in image detection, deep learning has become the mainstream method for cotter pin defect detection. Based on the detection process, detection algorithms can be divided into single-stage and multi-stage cascade detection algorithms.
(1) The single-stage detection algorithm, as illustrated in the process of
Figure 3, performs object localization and classification directly on the image, outputting the bounding box coordinates and class probabilities of the objects to produce the detection results. Gong et al. [
2] proposed a deep learning model based on an improved RetinaNet, named DDNet, for detecting defects in small cotter pins in electric power transmission systems from UAV images. This method enhances feature extraction capabilities by introducing ResNeSt50 as the backbone network and combining the feature pyramid network (FPN) with the receptive field block (RFB) to improve the detection accuracy of small targets. Experimental results demonstrate that the model performs well in the task of cotter pin defect detection. Based on the YOLOv5 algorithm, Yang et al. [
3] optimized the network structure and loss function, significantly improving the speed and accuracy of cotter pin defect detection, thereby providing effective technical support for intelligent inspection robots. However, since single-stage object detection extracts target features directly from the raw image, it is susceptible to interference from background noise, resulting in lower detection accuracy.
(2) Multi-stage cascade detection algorithms typically consist of two stages, as shown in the multi-stage object detection flowchart in
Figure 3. First, a region proposal network generates candidate regions containing the target. Then, a classification and regression network performs precise localization and classification of the target within these candidate regions. This approach is generally considered to provide higher accuracy. Li et al. [
4] introduced a lightweight detection method, CSSAdet (Cross-Scale Spatial Attention Detector), which combines spatial and cross-scale attention mechanisms. The method first detects component connection points in images captured by drones and then identifies the status of the cotter pins at these points. Li et al. [
5] proposed a two-level cascade detector, where the first level locates the cotter pin in the image, crops a region around it, and then sends this region to a second-level detector for fine-grained status recognition. Fang et al. [
6] developed a cascade network for detecting missing cotter pins and anti-vibration hammers. The network first identifies the target region and then extracts more robust features of the cotter pin, improving detection accuracy. Experimental results show that this two-level network significantly improves over non-cascade networks, especially in terms of accuracy for missing cotter pins. However, these algorithms are susceptible to interference from background information due to the specific challenges associated with detecting cotter pin defects, such as small target sizes and fine-grained issues. As a result, important cotter pin features may be overlooked during the candidate region generation phase, reducing the effectiveness of the two-stage network. Therefore, in subsequent experiments, this method did not demonstrate good performance. Additionally, due to the layered nature of the multi-stage process, the computational complexity is high, leading to reduced detection speed.
In summary, current cotter pin defect detection algorithms mainly focus on single-stage and multi-stage cascade detection. Single-stage detection algorithms offer high detection speed but relatively low detection accuracy. In cotter pin defect detection, multi-stage detection methods face challenges because the cotter pin target occupies a tiny portion of the image, with most of the area being background information. Additionally, other parts of the power tower components are similar in color and shape to the cotter pin, and each stage of the multi-stage detection model is susceptible to interference, resulting in lower detection accuracy and poor applicability. Therefore, improving algorithms for cotter pin defect detection is of great significance for addressing the issue of the cotter pins’ small size. Mainstream solutions primarily revolve around technical approaches such as multi-scale feature fusion or global context modeling. Feature pyramid-based methods (e.g., PFN [
7]) enhance the representation of tiny objects through cross-layer feature aggregation, while transformer-based models leverage self-attention mechanisms to capture long-range dependencies, yet suffer from high computational complexity and sensitivity to local details [
8]. As shown in
Figure 3, this paper performs sub-image segmentation on the dataset to increase the relative size of target regions (bounding boxes) within the image, thereby reducing interference from excessive background information for cotter pin defect detection. In terms of the algorithmic model, this paper proposes adding detection heads that focus on shallow feature information, fusing multi-scale feature information, and employing attention mechanisms to prioritize important detail features. It also suggests improving the loss function to better accommodate the characteristics of small-object detection, thereby enhancing detection performance for cotter pins.
The main contributions of this paper are as follows:
- (1)
P-C2f module. This paper combines the C2f module with the PSA (polarized self-attention) module to design a new module named P-C2f. Based on the C2f module, the P-C2f module integrates vertically polarized self-attention, enabling high-quality regression performance and reducing the impact of fine-grained issues in cotter pin defect detection on the model’s performance.
- (2)
Introduction of the MCA module before the detection head. This paper introduces the MCA (multidimensional collaborative attention) module before the model’s detection head. This module allows the network bottleneck layer to output image feature information and calculate the weights of different channels for the original feature map, enhancing the network’s attention to small targets of the cotter pin.
- (3)
Improvement of the loss function. In this paper, the boundary box loss function of YOLOv8 is replaced with a more appropriate WIOU (wise-IOU) loss function to reduce the impact of sample quality fluctuations on the model.
Based on the three key modules introduced in YOLOv8, we have named our model PMW-YOLOv8.
This paper is divided into five sections. The first section describes the background of research, current state, and cotter pin defect detection challenges. The second section discusses related works on cotter pin detection, summarizing the achievements and shortcomings of these works. The third section introduces the structure of the proposed PMW-YOLOv8 framework and the improved modules. The fourth section provides a detailed analysis of the experimental results, demonstrating the effectiveness and superiority of the proposed model. The fifth section summarizes the work in this paper and discusses future research directions.
3. PMW-YOLOv8 Network Model Structure
YOLOv8 achieves a balance between real-time performance and detection accuracy through its Cross-Stage Partial Network (CSPNet) and dynamic detection head design, which aligns with the dual requirements of rapid response and precise recognition in power inspection scenarios. Experimental results (as shown in Table 7) demonstrate that this model improves mean average precision (mAP) by 3.2% compared to YOLOv5, YOLOv7, and RT-DETR on our custom dataset. Furthermore, even the enhanced frameworks targeting the latest versions (PMW-YOLOv10 and PMW-YOLOv11) failed to surpass the performance of PMW-YOLOv8, further validating YOLOv8 as the optimal baseline.
In response to the challenges posed by the large number of cotter pins, small target sizes, and the fine-grained feature differences between different categories of targets, which make detection difficult, this paper improves upon the YOLOv8 architecture and proposes a new PMW-YOLOv8 model.
The structure of the PMW-YOLOv8 model is illustrated in
Figure 4. The proposed PMW-YOLOv8 feature extraction and detection process is as follows: First, the input power inspection images are preprocessed by resizing them to 640 × 640 × 3, ensuring uniformity in size when input into the neural network. After preprocessing, the images are passed into the backbone network, where the CBS layer, C2f layer, and SPFF layer work together to extract image features. As the layers of the backbone network increase, more profound levels of image features are extracted. Subsequently, the extracted features from different layers are concatenated in the neck network to achieve feature fusion, enhancing the model’s detection performance for targets of various sizes. Finally, the fused features are enhanced through the MCA module [
49] and P-C2f module, then sent to the decoupled detection heads (detect) for box detection and classification (cls) recognition. Each detection head produces feature maps for both classification and regression scales. The detection boxes are obtained after the feature maps from the four detection heads are concatenated. The boxes are filtered and mapped to the original image to produce the final detection results.
As shown in
Figure 4, in order to improve the model’s detection ability for small targets, the improved model adds a small-target detection head, P2, based on the YOLOv8 structure. It receives larger, shallower feature maps, allowing the model to leverage shallow image features where the contours of small targets are preserved, thereby improving the detection accuracy for small targets. Additionally, we designed a new P-C2f module to replace the original C2f module. The P-C2f module incorporates a PSA (polarized self-attention) mechanism [
50], effectively preventing information loss while selectively enhancing and fine-tuning feature information. This enhancement enables the model to tackle complex tasks with improved accuracy, such as cotter pin defect detection. In addition, we incorporated the MCA module into the added small-target detection head (P2), sending the feature information extracted by the P-C2f module into the MCA module. This module can fully capture the correlations between different dimensions, calibrate the attention weights generated across dimensions, and extract more refined features, improving the detection and classification accuracy for small cotter pins.
3.1. A Small Target Detection Layer
YOLOv8 is designed with three layers of feature maps, utilizing three detection heads (P3–P5) of different sizes for detecting targets of various sizes. However, due to the small size of the cotter pin targets on power transmission lines, the deep features obtained through continuous downsampling contain fewer features related to these small targets, which reduces the algorithm’s detection accuracy for the cotter pin targets.
Therefore, in PMW-YOLOv8, based on the original YOLOv8 model architecture, we added a new small-target detection head (P2) and the corresponding detection layer, as shown in the blue-shaded section of
Figure 4. After the original model’s C2f module outputs a feature map of size 80 × 80 × 256, an upsampling operation is applied to enlarge the feature map size, resulting in an output feature map of size 160 × 160 × 128. This feature map is fused with shallow features from the backbone network through a skip connection, generating a larger feature map containing richer small target feature information. Compared with the original three detection layers of YOLOv8, the newly added small-target detection layer structure cleverly fuses the shallower raw features of the image with the features extracted and enhanced by the network’s deeper layers. Thus, the resulting features contain more shallow detail information, significantly improving the detection accuracy for small targets such as cotter pins and their defects, partially compensating for the model’s performance shortcomings in small-target detection.
3.2. P-C2f: Integrating PSA into C2f
Due to the diversity of cotter pin targets in inspection images, they exhibit high complexity in size, shape, position, and angle. This leads to significant feature similarities among the three states—regular, loose, and missing. For example, in
Figure 2a,c belong to different categories but display high feature similarity. Conversely, the upper and lower subfigures in
Figure 2a belong to the same category but exhibit low feature similarity in image characteristics. Such situations reduce the model’s classification accuracy across different categories.
Although the C2f feature fusion module in YOLOv8 enables the model to capture contextual information and high-resolution details simultaneously, it treats features of different scales and levels equally, maintaining relatively similar attention across these features. For cotter pin defect detection, most features appear on a small scale and exhibit fine-grained characteristics. This requires great emphasis on shallow features to capture the fine-grained details of the target. Therefore, this paper introduces the polarized self-attention (PSA) mechanism [
46] into the C2f module and proposes a new P-C2f module.
The computation process of the spatial self-attention branch is represented as follows:
Here,
,
and
are standard 1 × 1 convolutions (Conv),
is intermediate parameters of the convolution.
and
are matrix reshaping operators, which transform matrices (or three-dimensional arrays) of shape 1 × H × W and C/2 × H × W into matrices of shape HW × 1 × 1 and C/2 × 1, respectively.
and
are the Sigmoid and Softmax operators,
is the global pooling operator, and “
” represents the dot product operation of matrices. The two modules are arranged in a series to form a complete PSA module. The operation of the complete PSA module is shown in Equation (3), where the processed feature vector from the channel attention branch is used as input to the spatial attention branch.
In this paper, by combining PSA with C2f, a new P-C2f module is proposed to replace part of the C2f module in YOLOv8, enhancing the model’s feature representation ability and improving detection accuracy. The structure of the P-C2f module is shown in
Figure 5. C2f is a two-branch structure, where, after generating an intermediate feature map of size H × W × C through the CBS (Conv2d + batchNorm2d + Silu) module, the feature map is split into two parts of size H × W × 1/2C using Split. One branch directly outputs to the final Concat block. In contrast, the other branch passes through multiple Bottleneck blocks for further processing before being sent to the Concat block for feature fusion. The processing in the Bottleneck module involves extracting deeper image features, during which the Bottleneck module processes features layer by layer to extract deeper-level features. By adding PSA to the Bottleneck, PSA can participate in this entire feature layer-by-layer processing. The Softmax function in PSA transforms the input feature values into a probability distribution, highlighting larger feature values and suppressing smaller ones (less important channel features). It adjusts the weights of each channel, enabling the model to automatically focus on the most relevant channels and spatial information, highlighting key features in the global context so that important features dominate the subsequent processing, which was validated in subsequent experiments.
The expression for PSA-Bottleneck is given by Equation (4), where
is the input to PSA-Bottleneck,
is its output, and
is a 3 × 3 convolution. The expression for P-C2f with PSA-Bottleneck replacing Bottleneck is given by Equation (5):
where
is the output of P-C2f,
is the input to C2f,
is a 1 × 1 convolution,
is the part of the features after splitting, directly passed to the Concat block, and
represents the stacking of n PSA-Bottleneck blocks.
The PSA mechanism adopts a dual-branch polarized design (Equations (1) and (2)). In the channel dimension, it achieves a “winner-takes-all” weight distribution through Softmax, while preserving multi-region responses via Sigmoid in the spatial dimension. This characteristic simultaneously enhances critical channels and spatial positions, effectively addressing the fine-grained problem of “large intra-class variations and small inter-class differences” in cotter pin detection. Compared with traditional attention mechanisms like SE or CBAM, PSA reduces computational complexity from O(C2) to O(C) through matrix reshaping operations (σ₁, σ₂), making it particularly suitable for YOLO architectures requiring multiple detection heads. Experimental results demonstrate that the P-C2f module increases parameters by only 3% while improving model performance. Heatmap visualizations reveal that gradient distributions in the PSA branch are more concentrated on geometric edges of cotter pins, whereas the original YOLOv8 model with pure C2f modules exhibits dispersed gradient patterns. This evidence confirms that PSA strengthens gradient backpropagation in discriminative regions through polarized weighting.
3.3. Embedding MCA into the Model
To further enhance the model’s feature representation ability, as well as its detection accuracy and efficiency for small targets, this paper adds an MCA module before each detection head in the YOLOv8 model, at positions 1, 2, 3, and 4 shown in
Figure 4. The features processed by the backbone network are fed into the MCA module, further enhancing the image features, thereby improving the detection and classification ability of the detection heads and reducing the model’s false detection rate.
The structure of MCA [
49] is shown in
Figure 6, adopting a three-branch architecture, with each branch responsible for modeling attention along the channel, width, and height dimensions. Taking the first branch as an example, the entire process of this branch can be summarized as follows:
represents the input feature to the MCA, which is sent to each branch. In the first branch, indicates a 90° rotation along the H direction (permute), indicates a 90° clockwise rotation of the feature matrix along the H direction, represents average pooling and standard pooling, is the activation transformation with a 1 × K convolution kernel, is the Sigmoid function, and represents matrix multiplication. The same operation is applied in the other spatial dimension along the height direction (H) to obtain . The channel dimension (C), similar to the previous two channels but with two fewer rotation operations, results in . Finally, the outputs from the three branches are averaged to obtain the feature map optimized with weights across different dimensions.
PMW-YOLOv8 integrates the MCA module before each detection head to enhance the feature map transmitted to the detection head and the subsequent layers, improving both channel and spatial dimensions. This enhancement mechanism strengthens the representation of small target features, ensuring that these fine-grained details are more pronounced during subsequent processing. Specifically, we integrate the MCA module after the P-C2f and C2f modules. Before the P-C2f and C2f modules, the Concat module receives the feature map from the previous layer, while the shallow feature map passes through the skip connection. These two feature maps are then concatenated to fuse deep and shallow features. The resulting fused feature map is subsequently passed into the P-C2f and C2f modules for further feature extraction.
On this basis, we integrate the MCA module to enhance the feature weights of the target region in the feature maps output by the C2f module. The mechanism of MCA in PM-YOLOv8 is described by Equation (7), where the output of MCA is represented as , the input to MCA is the output of C2f, or P-C2f, and is represented as . Small targets are typically concentrated in specific channels in the channel dimension, while background channels contribute less to the feature map. The channel branch applies convolution and pooling operations to the output of either P-C2f or C2f, effectively compressing the channel features. It adjusts the channel weights using the Sigmoid function to increase the attention on target-related channels while reducing the weights of irrelevant channels. The adjusted weights are then added to the original feature map via residual connections, focusing on the channel features that significantly contribute to small targets, ensuring their prominence is preserved in deeper layers and ultimately improving the accuracy of small-target detection. The exact process applies to the spatial dimension.
The integrated enhanced features follow two processing paths: on the one hand, the enhanced features are directly passed to the corresponding detection head for classification and localization prediction of feature maps with different resolutions; on the other hand, these features are downsampled further, fused with deeper layer features, and then passed to deeper detection heads for target detection. This optimization mechanism enables the model to more effectively highlight target region features while suppressing irrelevant background interference, especially demonstrating advantages in small-target detection tasks. In subsequent experiments, we confirmed the performance improvement of the model by integrating MCA through ablation experiments. Additionally, we conducted comparative experiments, validating that the integration of MCA in this model outperforms other attention mechanisms for this task. Finally, we compared different MCA integration strategies and demonstrated the validity of the MCA integration approach proposed in this paper.
3.4. WIOU Loss
The loss function of YOLOv8 is divided into classification and regression branches. The classification loss function is the BCE loss function, which helps the model correctly classify the detected objects. Its computation is represented by Equation (8), where
is the actual label of the sample, and
is the model’s predicted probability. The regression loss combines CIOU and DFL (distribution focal loss), which helps the model locate the detected objects.
The calculation of the DFL loss function is as follows:
and
are the model’s predicted values, and nearby predicted values
,
, and
represent the nearby label value, predicted value, and true label value, respectively.
In our training dataset, some low-quality examples inevitably occur, such as cases where the IOU between anchor boxes and target boxes is low or where the resolution of the target and background is low. The geometric factors, such as distance and aspect ratio, in the CIOU used by YOLOv8 increase the penalty for these low-quality samples while requiring significant computational resources. Therefore, we replaced the CIOU loss function used in YOLOv8 with the WIOU loss function [
51], reducing attention and interference when the detection box aligns well with the target box, thereby mitigating the penalty from geometric factors. This reduces the impact of low-quality samples on the model’s generalization performance, thereby improving the model’s generalization ability. The loss function of the improved model is shown in Equation (10):
In Equation (10), the three loss functions are weighted by proportional weights, where the hyperparameters a, b, and c represent the weights of the three loss functions. In this paper, a, b, and c are set to 7.5, 0.5, and 1.5, respectively. This is the WIOU loss function, as shown in Equation (11):
Parameters
and
represent the center points of the detection box and the ground truth box, respectively, while
and
are the width and height of the union region. When the detection box fits well with the ground truth box,
becomes smaller, significantly reducing the focus on the detection box with good matching performance. Conversely, when the detection box fits poorly, aligning with the ground truth box due to the presence of hard-to-classify or hard-to-predict samples during the training process,
becomes smaller. This encourages the model to focus on the most regular samples during training, reducing the emphasis on both high-quality and low-quality samples, thereby enhancing the model’s generalization ability. The specific expression of the parameter is given in Equation (13).
In Equation (13),
and δ are hyperparameters, where
and
in this paper.
is a parameter expressing the degree of matching between two boxes, which is inversely proportional to the degree of match between the boxes. The specific expression of
is given in Equation (14).
is the moving average of the IOU loss, which can dynamically adjust the gradient based on the current degree of matching between the two boxes and formulate the most suitable gradient gain distribution strategy for the current situation. This ensures that the gradient descent speed remains relatively level, accelerating the training process and improving detection accuracy.
4. Analysis of Experimental Results
4.1. Experimental Environment
The experimental hardware configuration includes a 12 vCPU Intel(R) Xeon(R) Silver 4214R CPU @ 2.40GHz, 32GB RAM (Intel Corporation, Santa Clara, CA, USA), and an NVIDIA 3080TI 12GB graphics card (NVIDIA Corporation, Santa Clara, CA, USA). The training and testing were conducted using the neural network framework Pytorch 2.4.1 + cu118. Furthermore, each model configuration (including the baseline YOLOv8 and PMW-YOLOv8) was independently executed 10 times under this environment with fixed hyperparameters (imgsz = 640, batch = 8, lr0 = 0.01, lrf = 0.001), and the reported results are the arithmetic means of these repeated experiments.
4.2. Evaluation Metrics
To assess the performance of the proposed PMW-YOLOv8 improved model, this paper uses three key evaluation metrics: precision (P), recall (R), and mean average precision at a 0.5 IOU threshold (mAP0.5). These metrics reflect different aspects of the model’s detection capabilities and complement each other, creating a comprehensive evaluation system for the model’s overall performance.
4.3. Datasets
The dataset of this study originates from the 2023 transmission line refined inspection project of a power supply bureau in Yunnan Province. The original dataset contains 23,371 drone-captured transmission line images covering critical power equipment such as insulators, fittings, conductors, and vibration dampers. To ensure data quality, we selected images containing cotter pins and their defects (loosening and missing) from the original dataset, excluding blurred samples or those with ambiguous defect categories. A total of 4218 valid images were retained. The annotations were performed in YOLO format using the Python v3.8.0-based open-source tool LabelImg v1.8.6 under the guidance of professional power engineers. The annotated targets include three categories of cotter pin defects: normal, loosening, and missing. Due to the small proportion of loose and missing cotter pins in regular transmission lines, the sample distribution of defective cotter pins is extremely imbalanced compared to normal ones. To address this issue, we applied data augmentation techniques such as flipping, tilting, and adding noise to the images containing defective cotter pins. The final dataset contains 9108 images, which are split into training, validation, and test sets at a ratio of 6:2:2.
The cotter pin is already small in size. The resolution in standard power inspection images is often high due to the need to record most electrical components in detail. This high-resolution processing further reduces the relative size of the cotter pin in the image, undoubtedly increasing the difficulty of precisely detecting cotter pin defects. Inspired by SAHI [
52], we adopted specific image-processing techniques to overcome this challenge during model training. Specifically, we divided the high-resolution inspection images into several smaller sub-images. As shown in
Figure 7, we set the cropped sub-image size to 640 × 640 pixels to fit the model’s input. Additionally, we set a 20% overlap for the cropped sections. As shown in
Figure 8a, the relative size of all target boxes is less than 20% of the image area, with the majority being less than 5%. Therefore, the 20% overlap ensures that the targets are not incomplete due to the cropping. On the other hand, we directly excluded sub-images with no targets in the training set to improve training efficiency.
The number of sub-images for each image in the cropping operation is shown in Equation (15).
Here,
and
represent the number of sub-images in the horizontal and vertical directions of the image, respectively.
and
are the width and height of the image in pixels, while
and
denote the pixel width and height of the sub-images.
and
are the overlap ratios of the sub-images in the horizontal and vertical directions, respectively, and
denotes the floor function.
After the image slicing, the number of sub-images in the training set, validation set, and test set are 12,977, 4217, and 4203, respectively.
Figure 8b provides a visual analysis of the cotter pin target sizes in the sliced dataset. A comparison shows that, although the number of small targets remains the majority in the dataset, their relative size has increased significantly after slicing. This is beneficial for detecting cotter pin defects.
We conducted training and testing on the dataset before and after slicing, and the results are shown in
Table 1. All performance metrics showed significant improvements, with precision and recall rates increasing by 6.6% and 14.3%, respectively. The AP for the three categories increased by 15.1%, 5.7%, and 16.4%, respectively, indicating significant performance improvements across different categories. Meanwhile, the overall mAP
0.5 metric increased by 13.6%, further confirming the general effectiveness of the slicing method in improving dataset performance. In a broader performance evaluation range, the mAP
0.5:0.95 metric increased by 10.9%. This result demonstrates the effectiveness of the slicing method in improving model performance and further proves the stability and reliability of this approach under different thresholds.
4.4. Ablation Study
4.4.1. Ablation Study on Model Improvements
We designed a series of ablation experiments to further investigate the effectiveness of the proposed model improvements and the independent contribution of each module to the model’s performance. These experiments gradually integrated the P2 detection head, P-C2f module, MCA module, and the use of the WIOU loss function, followed by quantitative evaluation and comparative analysis of the model’s performance. In all experiments, we used a unified training strategy and consistent hyperparameter settings to ensure the academic rigor and reproducibility of the results. The experimental results (as shown in
Table 2) demonstrate the positive impact of each improvement on the model’s performance.
After integrating the small-object detection head (P2), the model’s precision increased by 1.3%, and the AP for detecting cotter pin looseness and cotter pin missing defects improved by 2.8% and 0.2%, respectively. The mAP0.5 increased by 0.5%. After introducing the P-C2f module, the model’s mAP0.5 metric significantly improved by 1.14 percentage points. Remarkably, for detecting cotter pin looseness and missing defects, the AP improved by 1.2% and 2.5%, respectively. This result strongly supports the effectiveness of the polarization self-attention mechanism in enhancing the model’s feature representation ability.
Subsequently, after further integrating the MCA module based on the introduction of the P-C2f module, the model’s performance received another significant improvement. The mAP0.5 metric increased by 0.8 percentage points. In particular, for the detection tasks of cotter pin looseness and missing defects, precision increased by 1.8% and 0.6%, respectively. This result indicates that introducing the MCA module enhanced the model’s ability to effectively handle complex scenarios and capture multi-dimensional contextual information.
Finally, when the WIOU loss function was used during training, the model’s performance reached its optimal level. The mAP0.5 metric reached 66.3%, representing an improvement of three percentage points compared to the baseline network. Additionally, other key performance metrics also showed significant improvements. This result not only validates the effectiveness of the WIOU loss function in optimizing the model training process, but also comprehensively demonstrates the significant impact of the proposed model improvement strategy on enhancing overall model performance.
The ablation study data, shown in
Table 2, systematically validate the optimization of safety-critical metrics. The final model (OURS) achieved 5.9% and 4.7% AP gains for loosening and missing defects, respectively, demonstrating enhanced control over high-risk false negatives (FN). While the overall recall increased modestly from 58.1% to 58.8%, precision improved by 2.0%, confirming no false positive (FP) trade-off. Key module contributions include: WIOU loss alone boosted loosening AP by 4.2%; MCA further increased missing AP by 1.8%; and P2 + P-C2f integration improved missing AP by 4.7% over the baseline. These results, strictly derived from experimental data, validate the method’s reliability in power safety-critical scenarios.
After introducing the P2 small-object detection head to the baseline YOLOv8, the parameter count decreased by 0.8M, while the computational complexity increased by 19.4G FLOPs. This seemingly counterintuitive result arises from YOLOv8’s official design adjustment to the detection head channel count, balancing resolution and channel width to optimize performance (detailed explanation in Ultralytics’ GitHub discussion). The newly added MCA and P-C2f modules increased parameters by only 0.00003 M and 0.16 M, respectively, with FLOPs increments of 0.1 G and 0.7 G, while the WIOU loss function introduced no additional computational overhead as it involves no structural modifications. Ultimately, PMW-YOLOv8 reduced parameters by 0.64 M and increased FLOPs by 20.2 G compared to the baseline, achieving a balanced optimization of accuracy and efficiency.
4.4.2. Model Performance with Different Attention Modules
To fully validate the effectiveness and superiority of the MCA module in model optimization, we selected several classic attention modules, such as GAM [
53], CA [
54], CBAM [
55], and ECA [
56]. We conducted comparative experiments with the MCA module. We obtained the outcomes shown in
Table 3 by comparing the detection results. It is evident from the data in the table that optimal performance was achieved across key performance metrics, such as accuracy and mAP
0.5 in the model incorporating the MCA module. Specifically, compared to the baseline model, the introduction of the MCA module led to a significant improvement of 0.3% in precision. Further analysis revealed that, in tasks targeting specific defect categories (such as cotter pin loosening and cotter pin missing), the inclusion of the MCA module resulted in significant performance improvements. Specifically, the AP for cotter pin loosening defects improved by 3.8%, and the AP for cotter pin missing defects increased by 1.4%. This indicates that the MCA module has strong optimization capabilities for subtle and important detection tasks. Moreover, the overall mAP
0.5 of the model was also improved by 1.5%, and this comprehensive performance boost fully validates the core value and excellent effect of the MCA module in model optimization. This series of improvements fully demonstrates the effectiveness of the MCA module in model optimization.
4.4.3. Validation of the Effectiveness of the MCA Module Insertion Position
To deeply explore the rationale behind the optimal insertion positions of the MCA module in the model, we carefully designed five sets of control experiments (see
Table 4), labeled as groups a, b, c, d, and e, each representing a different application strategy for the MCA module. Among them, group d is the solution ultimately adopted in this study. In
Table 4, numbers 1, 2, 3, and 4 correspond to the four potential MCA module insertion points marked in
Figure 4, while the “√” symbol indicates that the MCA module is introduced at that position. Specifically, the solution adopted in this paper, group e, integrates the MCA module at all four positions shown in
Figure 4. On the one hand, the enhanced features are directly fed into the corresponding detection modules for precise classification and localization predictions on multi-resolution feature maps; on the other hand, these features undergo further downsampling and are fused with deeper features, which are then processed by the deeper detection heads to perform the detection tasks, thereby achieving comprehensive feature enhancement for the detection head and completing the overall object detection task. The experimental results are summarized in
Table 5.
Through comparative analysis, we observe that, compared to groups a, b, and c, the model using the d solution achieved significant advantages across several key performance metrics. These metrics include overall accuracy, mAP0.5 for cotter pin loosening and missing defects, overall mAP0.5, and the more comprehensive mAP0.5:0.95 range evaluation. Group d demonstrated optimal performance in all of these areas. This confirms the rationality of the selected MCA module insertion positions in the model and further demonstrates the effectiveness of this solution in enhancing detection accuracy and robustness.
4.4.4. A Comparison of the Performance of Different IOU Loss Functions in a Model
To comprehensively validate the effectiveness and superiority of the WIOU loss function in optimizing YOLOv8 for object detection, we conducted comparative experiments by applying classical bounding box loss functions—GIOU [
57], DIOU [
58], CIOU, Inner-IOU [
59], and NWD [
60]—to the YOLOv8 model. As shown in
Table 6, the YOLOv8 model with WIOU achieved optimal performance across all critical metrics: Precision-Recall Trade-off: WIOU attained a precision (P) of 78.3% and recall (R) of 57.9%, with a 0.2% improvement in precision over the CIOU baseline. Defect-Specific Performance: For the “loosening” category (e.g., cotter pin loosening), WIOU achieved an AP of 45.6%, surpassing CIOU by 4.2% and significantly outperforming NWD (43.4%). For the “missing” category (e.g., cotter pin missing), WIOU reached an AP of 59.9%, exceeding CIOU by 1% and NWD (59.1%).
Overall Detection Capability: The model achieved an mAP0.5 of 64.63%, representing a 1.3% improvement over the CIOU baseline and surpassing all other IOU variants. The mAP0.5:0.95 reached 38.7%, demonstrating state-of-the-art performance. These results validate that WIOU enhances the model’s performance in cotter pin defect detection tasks by refining gradient descent strategies during training, particularly addressing challenges in industrial defect recognition.
4.4.5. Comparison of Heatmaps Before and After Model Improvement
We generated heatmaps for several power inspection images containing cotter pin targets using both the PMW-YOLOv8 model and the YOLOv8 model, as shown in
Figure 9. These heatmaps visually show the models’ focus areas during the detection process. The improved network focuses more on the targets compared to the YOLOv8 model, with less attention given to irrelevant areas, resulting in a significant improvement in the detection of cotter pin targets.
4.5. Comparison Experiment
To comprehensively evaluate the performance of the PMW-YOLOv8 algorithm in cotter pin defect detection tasks, we selected a range of classic and currently advanced object detection algorithms, including, but not limited to, the classic YOLO [
61] series networks, and two-stage algorithm Faster-RCNN [
24], among others.
The model is evaluated by comparing the precision of each category and the overall mean average precision. The experimental results are shown in
Table 7. The results show that compared with the YOLOv5 series models, Faster-RCNN [
24] model, YOLOX model, YOLOv7 [
62] model, and the latest YOLO series models such as YOLOv10 [
63] and YOLO11, PMW-YOLOv8 shows relatively better performance in terms of precision (P), recall (R), mAP
0.5, and other indicators for different types of cotter pin defects. Specifically, PMW-YOLOv8 achieves the best performance among various detection models in P, R and AP for loosening and missing cotter pin defects, as well as overall mAP
0.5 and mAP
0.5~0.95.
It is worth mentioning that in the key indicator, mAP0.5, PMW-YOLOv8 outperforms the original YOLOv8 model by 3.0%, and even when compared with YOLOv10, the best-performing model among all comparison models, it still achieves a 1.6% improvement. Moreover, due to the addition of extra attention modules, our model sacrifices some detection speed, but compared to two-stage models, our model still shows certain advantages in detection performance. Therefore, the proposed PMW-YOLOv8 model achieves good detection performance in the cotter pin defect dataset, achieving a mAP0.5 of 66.3%, which is the highest among all compared models, and a precision of 80.1%, creating favorable conditions for intelligent inspection of cotter pin defects in power line patrols.
Through the analysis of metrics across different models, we observe the following: YOLO11 achieves the highest recall (60.8%) but a lower precision (76.4%) compared to YOLOv5s (precision: 79.6%, recall: 55.6%). This indicates that YOLO11m prioritizes detecting potential defects (e.g., “missing” with AP: 59.5%) through relaxed confidence thresholds or broader feature extraction, while YOLOv5s emphasizes reducing false positives (e.g., “normal” class AP: 87.3%) via stricter localization strategies. Tph-YOLOv5 sacrifices both precision (78.5%) and recall (54.9%) to prioritize real-time performance (FPS: 42.2), reflecting the inherent speed-accuracy trade-off in lightweight designs. Our method balances precision (78.1%), recall (58.1%), and speed (FPS: 74.3) by introducing a dedicated detection head for small objects and optimizing feature extraction and enhancement mechanisms.
4.6. Detection Results
In addition, we present the detection performance of our proposed model alongside other models in the following figures. We selected representative images to compare the detection results of different models. Since the cotter pin targets are relatively small, [original size display might fail to clearly show the cotter pin targets], therefore we only display the regions containing cotter pin targets from high-resolution images for presentation, as shown in
Figure 10 and
Figure 11 (ground truth). Meanwhile, since the cotter pin targets are relatively small, we use different colors for the detection boxes in the upcoming result displays to avoid label text overlapping with the targets. Specifically, the red boxes represent the normal (normal) cotter pin targets detected by the model, the green boxes represent the loosening (loosening) cotter pin targets, and the blue boxes represent the missing (missing) cotter pin targets.
From the detection results, we can observe the following: In
Figure 10, the background is relatively simple and monotonous, containing 8 normal cotter pin targets. Our PMW-YOLOv8 model, like most models, correctly detected all the targets. In contrast, the Faster-RCNN model not only produced multiple false detections but also misclassified one normal cotter pin target as a loosening target.
In
Figure 11, the complex structure of the transmission tower and the interwoven vegetation form a complex background, particularly the metal structure of the transmission tower, whose color is very close to that of the cotter pin, making detection more challenging. Compared to the scene in
Figure 10, the targets in
Figure 11 are relatively smaller, further increasing the difficulty of detection. This scene contains 14 cotter pin targets, including 13 regular targets and 1 missing target. Despite these challenges, our model performed well in this scenario, accurately detecting 13 out of the 14 normal cotter pin targets, with no misclassifications. Other models, such as YOLOX, Faster-RCNN, and RT-DETR, experienced missed detections, while models like YOLOv7, YOLOv8, and YOLO11 exhibited false detections.
Due to the relatively rare occurrence of cotter pin loosening and missing defects, there are typically only one or a few targets in each inspection image, which makes comparisons based on individual images less reliable. To improve the accuracy of the experiment, we selected multiple images containing scenes of cotter pin loosening and missing defects. We combined the detection results from each model by slicing and stitching the images. This increases the sample size, enabling a more comprehensive comparison of the detection performance of each model.
Figure 12 shows four scenes of cotter pin loosening, each scene containing one loosening target and two regular cotter pin targets. Our PMW-YOLOv8 model correctly detected all the samples. In contrast, other models did not perform as well. For instance, TPH-YOLOv5, YOLOv10, and YOLO11 exhibited false detections, misclassifying the loosening cotter pin as a regular cotter pin. In the second scene, TPH-YOLOv5, YOLOX, and YOLOv8 also misclassified the loosening cotter pin as a regular cotter pin, while YOLOv7 misclassified a regular bolt as a missing cotter pin. In the third scene, models such as YOLOv5, YOLOv8, and YOLOv10 all misclassified the loosening cotter pin as a regular cotter pin, while TPH-YOLOv5 and YOLOv7 misclassified a regular bolt as a missing cotter pin. In the fourth scene, YOLOv5m and TPH-YOLOv5 similarly exhibited false detections.
Figure 13 presents four detection scenes of cotter pin missing defects, including four missing cotter pin targets and nine normal cotter pin targets. Compared to other models, our PMW-YOLOv8 model achieved the best detection performance, correctly identifying all targets in the four scenes and demonstrating excellent recall and accuracy in detecting cotter pin defects. In the first scene, the Faster-RCNN model missed one missing cotter pin target. In the second scene, the YOLOv5s and TPH-YOLOv5 models missed one missing cotter pin target, while YOLOv7 and Faster-RCNN misclassified one missing cotter pin target as a normal cotter pin target. In the third scene, models such as YOLOv5, YOLOv8, YOLOX, YOLOv10, and YOLO11 missed one normal cotter pin target, while YOLOv7 misclassified one normal cotter pin target as a missing cotter pin target. In the fourth scene, YOLOv5, YOLOX, YOLOv10, and RT-DETR models missed one normal cotter pin target.
To demonstrate the generalization capability of the PMW-YOLOv8 model, we selected an image where both YOLOv8 and PMW-YOLOv8 produced identical detection results under normal weather conditions, as shown in
Figure 14. Both models detected seven cotter pin targets in clear weather. By simulating scenarios with snow, rain, thunderstorms, and dense fog, we compared the changes in detection results between the two models. Under simulated rain and thunderstorm conditions, both models exhibited identical detection results, successfully identifying all cotter pins. However, in snowy conditions, YOLOv8 detected six cotter pins with one missed detection, while PMW-YOLOv8 detected all targets. In dense fog conditions, both models experienced missed detections: YOLOv8 detected only three cotter pins, whereas PMW-YOLOv8 detected five. These results demonstrate the superior generalization performance of PMW-YOLOv8 under diverse weather conditions.
5. Conclusions
This paper addresses the minor object detection issue faced by existing cotter pin detection methods by segmenting high-resolution images into several sub-images during the model training phase. This increases the relative size of cotter pin targets in the images, improving detection accuracy. In the PMW-YOLOv8 model, a small-object detection head was added, which fully utilizes the edge information and shallow features of small objects to enhance detection precision.
Additionally, we integrated an MCA module into each of the four target detection heads (P2–P5), further enhancing the extracted features. This integration improves the model’s ability to detect and classify small objects. To address the fine-grained issues in cotter pin detection, we introduced a polarized self-attention mechanism (PSA) on top of the C2f module, proposing a P-C2f module. This module is incorporated into the model to strengthen the extraction and processing of fine-grained information, thereby enhancing detection accuracy for fine-grained cotter pin targets.
The proposed PMW-YOLOv8 was tested on the cotter pin defect dataset, and the experimental results show that PMW-YOLOv8 outperforms the native YOLOv8 algorithm in key metrics such as AP and mAP0.5 when detecting common defects such as loose and missing cotter pins. Moreover, compared to other classic detection algorithms, PMW-YOLOv8 achieves satisfactory results in precision, AP, mAP0.5, and mAP0.5:0.95.
Overall, PMW-YOLOv8 demonstrates better detection performance and effectively improves the detection of cotter pin defects in power transmission line inspections.
Although the proposed PMW-YOLOv8 demonstrates improved performance in cotter pin defect detection for power transmission lines, several challenges remain to be addressed in future research:
(1) Sensitivity to adverse weather conditions: Current methods may suffer from performance degradation due to image quality deterioration caused by rain, snow, or fog. To enhance model robustness in harsh weather scenarios, we will explore weather-invariant feature learning through generative adversarial training and physics-based image restoration techniques.
(2) Edge deployment constraints: The increased model complexity (parameters and computations) for achieving high detection accuracy could limit real-time processing on edge devices such as UAVs and mobile platforms. Future work will focus on lightweight network architecture design and model compression techniques (e.g., pruning and quantization) while maintaining competitive performance, thereby enabling efficient deployment on resource-constrained devices.