Detection of Cotter Pin Defects in Transmission Lines Based on Improved YOLOv8

Wang, Peng; Yuan, Guowu; Zhang, Zhiqin; Rao, Junlin; Ma, Yi; Zhou, Hao

doi:10.3390/electronics14071360

Open AccessArticle

Detection of Cotter Pin Defects in Transmission Lines Based on Improved YOLOv8

by

Peng Wang

¹,

Guowu Yuan

¹

,

Zhiqin Zhang

¹,

Junlin Rao

¹,

Yi Ma

² and

Hao Zhou

^1,*

¹

School of Information Science and Engineering, Yunnan University, Kunming 650504, China

²

Yunnan Power Grid Co., Ltd., Kunming 650011, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(7), 1360; https://doi.org/10.3390/electronics14071360

Submission received: 29 January 2025 / Revised: 20 March 2025 / Accepted: 26 March 2025 / Published: 28 March 2025

Download

Browse Figures

Versions Notes

Abstract

:

The cotter pin is a critical component in power transmission lines, as it prevents the loosening or detachment of nuts at essential locations. Therefore, detecting defects in cotter pins is vital for monitoring and diagnosing faults in power transmission systems. Due to environmental factors and human errors, cotter pins are susceptible to loosening and becoming missing. In split pin detection, the primary challenges lie in the small size of the target features and the fine-grained issue of “small inter-class differences and large intra-class variations”. This paper aims to enhance the detection performance of the model for fine-grained small targets by adding a detection head specifically designed for small objects and embedding an attention mechanism. This paper addresses the detection of looseness and missing defects in cotter pins by proposing a target detection model called PMW-YOLOv8 (P-C2f + MCA + WIOU) based on the YOLOv8 framework. The model introduces a specialized small-target detection head (160 × 160), which forms a four-scale pyramid (P2–P5) through cross-layer aggregation, effectively utilizing shallow features. Additionally, it incorporates a multidimensional collaborative attention (MCA) module to enhance the features transmitted to the detection head. To further address the fine-grained feature extraction problem, a polarization self-attention mechanism is integrated into C2f, leading to the proposed P-C2f module. Finally, the WIOU loss function is applied to the model to mitigate the impact of sample quality fluctuations on training. Experiments were conducted on a cotter pin defect dataset to validate the model’s effectiveness, achieving a detection accuracy of 66.3%, an improvement of 3% over YOLOv8. The experimental results demonstrate that our model exhibits strong robustness and generalization, enabling it to extract more profound and comprehensive features.

Keywords:

YOLOv8; cotter pin defects; small-target detection; attention mechanism

1. Introduction

The cotter pin is a critical stabilizing component in power transmission lines, playing a vital role in maintaining the structural integrity of the transmission system. Given power transmission line’s complex and harsh operational environment, cotter pins are exposed to prolonged stress, making them susceptible to loosening and detachment. This risk poses a threat to the safety and stability of the transmission system, emphasizing the need for timely detection of cotter pin defects. Therefore, prompt identification of any defects in cotter pin is particularly important.

However, detecting cotter pins in power transmission lines presents unique challenges compared to general object detection. Figure 1 shows a typical inspection image containing cotter pins. It is evident that in regular inspection images, cotter pins are usually not the focal point, making them extremely small and difficult to observe with the naked eye. A power inspection image containing two cotter pins with their positions marked by red bounding boxes is shown in Figure 1. The original image dimensions are 2000 × 1125 pixels, while the annotated regions for the two cotter pins measure only 80 × 50 pixels and 75 × 75 pixels, occupying approximately 0.18% and 0.25% of the total image area, respectively, both well below the established threshold for small-object detection (typically <1% of image area). Therefore, the primary challenge in cotter pin defect detection is the small target size. Figure 2a–c represents close-up slices of the regular, loose, and missing cotter pin forms, respectively. The differences between these three forms are pretty subtle. However, in inspection images taken from different angles and distances, the difference between Figure 2a and Figure 2c (representing different cotter pin forms) is slight. In contrast, the difference between 1 and 2 (representing the same cotter pin form) in image a is quite significant. Another challenge in cotter pin defect detection is the fine-grained problem of “small inter-class differences and large intra-class differences”. Additional factors include the diversity of inspection environments, the complexity of power facilities, changes in weather and lighting conditions, interference from human and natural factors, and limitations of image capture devices. This results in the background of power inspection images being generally complex, further increasing the difficulty of detecting missing cotter pin defects.

In traditional power line inspections, the images captured by drones contain a large number of cotter pins, which presents significant challenges in detection, resulting in high detection costs, low efficiency, and limited accuracy. Inspectors must manually examine each cotter pin one by one, which consumes considerable time and effort. This increases operational and maintenance costs and introduces the risk of missed or false detections due to human error. Therefore, integrating drone image acquisition with deep learning holds critical significance for industrial inspections [1]. By leveraging deep learning techniques, we can develop models capable of automatically identifying cotter pin defects by leveraging deep learning techniques. These models can automatically analyze image or video data from power transmission lines, quickly and accurately detecting defective cotter pins. This not only reduces the workload of manual inspections and lowers operational costs, but also improves detection accuracy, ensuring the safety and stability of power transmission.

Research on cotter pin defect detection is relatively scarce and incomplete, both domestically and internationally, and it also has certain limitations. Early detection of power transmission line fittings and their defects primarily relied on manually designed features, resulting in low detection efficiency and accuracy. With the development of deep learning, which excels in feature extraction, processing, and localization in image detection, deep learning has become the mainstream method for cotter pin defect detection. Based on the detection process, detection algorithms can be divided into single-stage and multi-stage cascade detection algorithms.

(1) The single-stage detection algorithm, as illustrated in the process of Figure 3, performs object localization and classification directly on the image, outputting the bounding box coordinates and class probabilities of the objects to produce the detection results. Gong et al. [2] proposed a deep learning model based on an improved RetinaNet, named DDNet, for detecting defects in small cotter pins in electric power transmission systems from UAV images. This method enhances feature extraction capabilities by introducing ResNeSt50 as the backbone network and combining the feature pyramid network (FPN) with the receptive field block (RFB) to improve the detection accuracy of small targets. Experimental results demonstrate that the model performs well in the task of cotter pin defect detection. Based on the YOLOv5 algorithm, Yang et al. [3] optimized the network structure and loss function, significantly improving the speed and accuracy of cotter pin defect detection, thereby providing effective technical support for intelligent inspection robots. However, since single-stage object detection extracts target features directly from the raw image, it is susceptible to interference from background noise, resulting in lower detection accuracy.

(2) Multi-stage cascade detection algorithms typically consist of two stages, as shown in the multi-stage object detection flowchart in Figure 3. First, a region proposal network generates candidate regions containing the target. Then, a classification and regression network performs precise localization and classification of the target within these candidate regions. This approach is generally considered to provide higher accuracy. Li et al. [4] introduced a lightweight detection method, CSSAdet (Cross-Scale Spatial Attention Detector), which combines spatial and cross-scale attention mechanisms. The method first detects component connection points in images captured by drones and then identifies the status of the cotter pins at these points. Li et al. [5] proposed a two-level cascade detector, where the first level locates the cotter pin in the image, crops a region around it, and then sends this region to a second-level detector for fine-grained status recognition. Fang et al. [6] developed a cascade network for detecting missing cotter pins and anti-vibration hammers. The network first identifies the target region and then extracts more robust features of the cotter pin, improving detection accuracy. Experimental results show that this two-level network significantly improves over non-cascade networks, especially in terms of accuracy for missing cotter pins. However, these algorithms are susceptible to interference from background information due to the specific challenges associated with detecting cotter pin defects, such as small target sizes and fine-grained issues. As a result, important cotter pin features may be overlooked during the candidate region generation phase, reducing the effectiveness of the two-stage network. Therefore, in subsequent experiments, this method did not demonstrate good performance. Additionally, due to the layered nature of the multi-stage process, the computational complexity is high, leading to reduced detection speed.

In summary, current cotter pin defect detection algorithms mainly focus on single-stage and multi-stage cascade detection. Single-stage detection algorithms offer high detection speed but relatively low detection accuracy. In cotter pin defect detection, multi-stage detection methods face challenges because the cotter pin target occupies a tiny portion of the image, with most of the area being background information. Additionally, other parts of the power tower components are similar in color and shape to the cotter pin, and each stage of the multi-stage detection model is susceptible to interference, resulting in lower detection accuracy and poor applicability. Therefore, improving algorithms for cotter pin defect detection is of great significance for addressing the issue of the cotter pins’ small size. Mainstream solutions primarily revolve around technical approaches such as multi-scale feature fusion or global context modeling. Feature pyramid-based methods (e.g., PFN [7]) enhance the representation of tiny objects through cross-layer feature aggregation, while transformer-based models leverage self-attention mechanisms to capture long-range dependencies, yet suffer from high computational complexity and sensitivity to local details [8]. As shown in Figure 3, this paper performs sub-image segmentation on the dataset to increase the relative size of target regions (bounding boxes) within the image, thereby reducing interference from excessive background information for cotter pin defect detection. In terms of the algorithmic model, this paper proposes adding detection heads that focus on shallow feature information, fusing multi-scale feature information, and employing attention mechanisms to prioritize important detail features. It also suggests improving the loss function to better accommodate the characteristics of small-object detection, thereby enhancing detection performance for cotter pins.

The main contributions of this paper are as follows:

(1): P-C2f module. This paper combines the C2f module with the PSA (polarized self-attention) module to design a new module named P-C2f. Based on the C2f module, the P-C2f module integrates vertically polarized self-attention, enabling high-quality regression performance and reducing the impact of fine-grained issues in cotter pin defect detection on the model’s performance.
(2): Introduction of the MCA module before the detection head. This paper introduces the MCA (multidimensional collaborative attention) module before the model’s detection head. This module allows the network bottleneck layer to output image feature information and calculate the weights of different channels for the original feature map, enhancing the network’s attention to small targets of the cotter pin.
(3): Improvement of the loss function. In this paper, the boundary box loss function of YOLOv8 is replaced with a more appropriate WIOU (wise-IOU) loss function to reduce the impact of sample quality fluctuations on the model.

Based on the three key modules introduced in YOLOv8, we have named our model PMW-YOLOv8.

This paper is divided into five sections. The first section describes the background of research, current state, and cotter pin defect detection challenges. The second section discusses related works on cotter pin detection, summarizing the achievements and shortcomings of these works. The third section introduces the structure of the proposed PMW-YOLOv8 framework and the improved modules. The fourth section provides a detailed analysis of the experimental results, demonstrating the effectiveness and superiority of the proposed model. The fifth section summarizes the work in this paper and discusses future research directions.

2. Related Work

2.1. Small Object Detection

In the MS-COCO [9] metric evaluation, objects with an area no more significant than 32 × 32 pixels are classified as small objects. Conventional object detectors have gradually matured in detecting medium and large objects with the development of object detection technology. However, conventional detectors that rely on abundant visual features are not well-suited for detection of small objects due to their small pixel size and limited visual features. With the development and continuous improvement of the object detection field, many methods and models for small object detection have been proposed, leading to continuous improvements in detection accuracy and efficiency.

Multi-scale feature fusion is an effective method for small-object detection. In many classic object detection algorithms, shallow features lack semantic information, while deep features cause a significant loss of small-object location and feature information, making detection more challenging. As a result, image pyramids were introduced. Initially, researchers extracted features and performed object detection on images of different scales to detect objects of various sizes, and small-object detection benefited from this. However, this method leads to a sharp increase in storage and computation requirements. SSD [10] was proposed to address this issue. By extracting features from images at different scales, shallow features are used to detect small objects [11]. In contrast, deep features detect larger objects, laying the foundation for subsequent small-object detection. Later, the ION (inside-outside network) [12] object detection method was introduced, which fuses important feature information from different scale feature maps to obtain the final feature information, partially addressing the issue of small-object detection being easily affected by background noise. Feature pyramids connect feature information from bottom-up and top-down in a lateral manner, adding minimal computation and fully utilizing the semantic information of features at different scales. Feature pyramids have also become a mainstream feature extraction method.

In addition to improving the detection model, data augmentation of the dataset is also an important method for enhancing small-object detection. Early object detection techniques such as deformation [13], rotation, cropping, and scaling [14] were already used as important data augmentation methods. With the development of small-object detection, more data augmentation methods have been proposed. The copy-paste strategy [15] increases the number of small objects by copying and pasting small objects, thereby improving small-object detection performance. Scale matching methods [16], image stitching [17], and other data augmentation techniques have also been shown to effectively improve small-object detection accuracy. However, although these methods can increase the number of detected objects, the feature information of the objects remains unchanged, thus offering limited improvement in small-object detection performance.

Background is an important element in object detection, occupying a large portion of the input image, and it may interfere with the recognition of target objects. On the other hand, in natural scenes, background information can also serve as helpful information to detect small objects with less feature information, i.e., small-object detection based on contextual information. In facial detection research, contextual information plays a key role by exploring the relationships between targets within the target domain and the relationships between the target and the overall image, improving the detection accuracy of small objects in the COCO dataset [18,19]. Cai [20] et al. proposed a multi-scale candidate region extraction network that incorporates contextual information to learn feature information at different scales, aiming to improve the detection efficiency of small objects.

Multi-scale feature fusion and dataset augmentation can improve the detection efficiency of small objects to some extent. However, for open pin targets, their relative size in inspection images is relatively small, and the feature scale is quite uniform, so the improvement in the detection performance of open pin defects is limited. Similarly, data augmentation provides limited help in detecting open pin defects. Therefore, relying solely on data augmentation to improve the detection efficiency of open pin defects is insufficient. Background information is an important part of many object detection tasks, but for open pins, due to their tiny relative size, the majority of the area is occupied by background information, and excessive attention to background information may reduce the detection efficiency of cotter pins.

2.2. Fine-Grained Object Detection

Fine-grained recognition aims to distinguish between multiple subcategories that belong to the same essential category but exhibit visual similarities, such as animal species, car models, etc. [21]. Due to their high similarity and few distinguishing features, fine-grained classification and recognition are highly challenging. After years of development, fine-grained recognition has led to various solutions from different perspectives.

Based on recognition using localization and classification sub-networks, in this method, researchers construct a localization sub-network that can locate key areas and extract local features for classification. Among these, using segmentation or detection techniques to locate key regions in the image is an important method [22,23,24], and this approach initially relied on additional bounding boxes and local key points [25,26,27,28] to extract local features for fine-grained recognition. However, this approach involves significant work and is unsuitable for large-scale classification tasks. Therefore, weakly supervised recognition methods using only image-level labels have become mainstream [29,30,31,32,33]. In addition to using detection or segmentation techniques, using filters [34,35,36] or attention mechanisms [37,38] is an important approach in classification sub-networks.

End-to-end feature encoding-based recognition is also an important paradigm for fine-grained recognition. Feature learning is a crucial part of computer vision tasks, where deep convolutional features are widely used because they contain deeper convolutional features, and local convolutions have been shown to have significant effects in fine-grained tasks [39,40]. The loss function is a critical step in building neural network models, and designing appropriate loss functions for fine-grained recognition tasks is also an important means to improve fine-grained recognition performance [41,42].

In addition to relying solely on images for fine-grained recognition, some researchers leverage information beyond the images to assist in accomplishing tasks. For instance, researchers integrate multimodal information, such as text, to establish joint representation features, thereby improving recognition accuracy [43,44,45,46]. Furthermore, in addition to multimodal information, some studies have proposed seeking human assistance during the iterative training process of neural network models. The model’s learning process is supported through human-computer collaboration, enabling the model to understand how humans perform fine-grained recognition [47,48].

Recognition based on the localization and classification sub-network is beneficial for detecting the defects of cotter pins. However, this approach undoubtedly increases the workload and time cost of various stages, such as dataset creation, model training, and detection. Additionally, misdetections in key regions can directly accumulate into the second stage of target detection, limiting detection accuracy. Similarly, using auxiliary information for detection also increases the workload of the detection task to some extent. The relevant auxiliary information is limited and lacks a clear standard to follow, which may hinder the completion of the detection task.

3. PMW-YOLOv8 Network Model Structure

YOLOv8 achieves a balance between real-time performance and detection accuracy through its Cross-Stage Partial Network (CSPNet) and dynamic detection head design, which aligns with the dual requirements of rapid response and precise recognition in power inspection scenarios. Experimental results (as shown in Table 7) demonstrate that this model improves mean average precision (mAP) by 3.2% compared to YOLOv5, YOLOv7, and RT-DETR on our custom dataset. Furthermore, even the enhanced frameworks targeting the latest versions (PMW-YOLOv10 and PMW-YOLOv11) failed to surpass the performance of PMW-YOLOv8, further validating YOLOv8 as the optimal baseline.

In response to the challenges posed by the large number of cotter pins, small target sizes, and the fine-grained feature differences between different categories of targets, which make detection difficult, this paper improves upon the YOLOv8 architecture and proposes a new PMW-YOLOv8 model.

The structure of the PMW-YOLOv8 model is illustrated in Figure 4. The proposed PMW-YOLOv8 feature extraction and detection process is as follows: First, the input power inspection images are preprocessed by resizing them to 640 × 640 × 3, ensuring uniformity in size when input into the neural network. After preprocessing, the images are passed into the backbone network, where the CBS layer, C2f layer, and SPFF layer work together to extract image features. As the layers of the backbone network increase, more profound levels of image features are extracted. Subsequently, the extracted features from different layers are concatenated in the neck network to achieve feature fusion, enhancing the model’s detection performance for targets of various sizes. Finally, the fused features are enhanced through the MCA module [49] and P-C2f module, then sent to the decoupled detection heads (detect) for box detection and classification (cls) recognition. Each detection head produces feature maps for both classification and regression scales. The detection boxes are obtained after the feature maps from the four detection heads are concatenated. The boxes are filtered and mapped to the original image to produce the final detection results.

As shown in Figure 4, in order to improve the model’s detection ability for small targets, the improved model adds a small-target detection head, P2, based on the YOLOv8 structure. It receives larger, shallower feature maps, allowing the model to leverage shallow image features where the contours of small targets are preserved, thereby improving the detection accuracy for small targets. Additionally, we designed a new P-C2f module to replace the original C2f module. The P-C2f module incorporates a PSA (polarized self-attention) mechanism [50], effectively preventing information loss while selectively enhancing and fine-tuning feature information. This enhancement enables the model to tackle complex tasks with improved accuracy, such as cotter pin defect detection. In addition, we incorporated the MCA module into the added small-target detection head (P2), sending the feature information extracted by the P-C2f module into the MCA module. This module can fully capture the correlations between different dimensions, calibrate the attention weights generated across dimensions, and extract more refined features, improving the detection and classification accuracy for small cotter pins.

3.1. A Small Target Detection Layer

YOLOv8 is designed with three layers of feature maps, utilizing three detection heads (P3–P5) of different sizes for detecting targets of various sizes. However, due to the small size of the cotter pin targets on power transmission lines, the deep features obtained through continuous downsampling contain fewer features related to these small targets, which reduces the algorithm’s detection accuracy for the cotter pin targets.

Therefore, in PMW-YOLOv8, based on the original YOLOv8 model architecture, we added a new small-target detection head (P2) and the corresponding detection layer, as shown in the blue-shaded section of Figure 4. After the original model’s C2f module outputs a feature map of size 80 × 80 × 256, an upsampling operation is applied to enlarge the feature map size, resulting in an output feature map of size 160 × 160 × 128. This feature map is fused with shallow features from the backbone network through a skip connection, generating a larger feature map containing richer small target feature information. Compared with the original three detection layers of YOLOv8, the newly added small-target detection layer structure cleverly fuses the shallower raw features of the image with the features extracted and enhanced by the network’s deeper layers. Thus, the resulting features contain more shallow detail information, significantly improving the detection accuracy for small targets such as cotter pins and their defects, partially compensating for the model’s performance shortcomings in small-target detection.

3.2. P-C2f: Integrating PSA into C2f

Due to the diversity of cotter pin targets in inspection images, they exhibit high complexity in size, shape, position, and angle. This leads to significant feature similarities among the three states—regular, loose, and missing. For example, in Figure 2a,c belong to different categories but display high feature similarity. Conversely, the upper and lower subfigures in Figure 2a belong to the same category but exhibit low feature similarity in image characteristics. Such situations reduce the model’s classification accuracy across different categories.

Although the C2f feature fusion module in YOLOv8 enables the model to capture contextual information and high-resolution details simultaneously, it treats features of different scales and levels equally, maintaining relatively similar attention across these features. For cotter pin defect detection, most features appear on a small scale and exhibit fine-grained characteristics. This requires great emphasis on shallow features to capture the fine-grained details of the target. Therefore, this paper introduces the polarized self-attention (PSA) mechanism [46] into the C2f module and proposes a new P-C2f module.

A^{c h} (X) = F_{S G} [W_{z} ((σ_{1} (W_{v} (X)) \times F_{S M} (σ_{2} (W_{q} (X))))]

(1)

The computation process of the spatial self-attention branch is represented as follows:

A^{s p} (X) = F_{S G} [σ_{3} (F_{S M} (σ_{1} (F_{G P} (W_{q} (X)))) \times σ_{2} (W_{v} (X)))]

(2)

Here,

W z

,

W v

and

W_{q}

are standard 1 × 1 convolutions (Conv),

θ_{1}

is intermediate parameters of the convolution.

σ_{1}

and

σ_{2}

are matrix reshaping operators, which transform matrices (or three-dimensional arrays) of shape 1 × H × W and C/2 × H × W into matrices of shape HW × 1 × 1 and C/2 × 1, respectively.

F_{S G}

and

F_{S M}

are the Sigmoid and Softmax operators,

F_{G P}

is the global pooling operator, and “

\times

” represents the dot product operation of matrices. The two modules are arranged in a series to form a complete PSA module. The operation of the complete PSA module is shown in Equation (3), where the processed feature vector from the channel attention branch is used as input to the spatial attention branch.

P S A_{s} (X) = A^{s p} (A^{c h})

(3)

In this paper, by combining PSA with C2f, a new P-C2f module is proposed to replace part of the C2f module in YOLOv8, enhancing the model’s feature representation ability and improving detection accuracy. The structure of the P-C2f module is shown in Figure 5. C2f is a two-branch structure, where, after generating an intermediate feature map of size H × W × C through the CBS (Conv2d + batchNorm2d + Silu) module, the feature map is split into two parts of size H × W × 1/2C using Split. One branch directly outputs to the final Concat block. In contrast, the other branch passes through multiple Bottleneck blocks for further processing before being sent to the Concat block for feature fusion. The processing in the Bottleneck module involves extracting deeper image features, during which the Bottleneck module processes features layer by layer to extract deeper-level features. By adding PSA to the Bottleneck, PSA can participate in this entire feature layer-by-layer processing. The Softmax function in PSA transforms the input feature values into a probability distribution, highlighting larger feature values and suppressing smaller ones (less important channel features). It adjusts the weights of each channel, enabling the model to automatically focus on the most relevant channels and spatial information, highlighting key features in the global context so that important features dominate the subsequent processing, which was validated in subsequent experiments.

X_{P S A - B o t t l e n e c k} = P S A_{s} (W_{3 \times 3} (W_{3 \times 3} (X))) + X

(4)

The expression for PSA-Bottleneck is given by Equation (4), where

X

is the input to PSA-Bottleneck,

X_{P S A - B o t t l e n e c k}

is its output, and

W_{3 \times 3}

is a 3 × 3 convolution. The expression for P-C2f with PSA-Bottleneck replacing Bottleneck is given by Equation (5):

Y = W_{3 \times 3} (C o n c a t (X_{S p l i t} (W_{1 \times 1} (X)), X_{P S A - B o t t l e n e c k}^{n} (X_{S p l i t} (W_{1 \times 1} (X)))))

(5)

where

Y \in ℝ^{h \times w \times c}

is the output of P-C2f,

X \in ℝ^{h \times w \times c}

is the input to C2f,

W_{1 \times 1}

is a 1 × 1 convolution,

X_{S p l i t}

is the part of the features after splitting, directly passed to the Concat block, and

X_{P S A - B o t t l e n e c k}^{n}

represents the stacking of n PSA-Bottleneck blocks.

The PSA mechanism adopts a dual-branch polarized design (Equations (1) and (2)). In the channel dimension, it achieves a “winner-takes-all” weight distribution through Softmax, while preserving multi-region responses via Sigmoid in the spatial dimension. This characteristic simultaneously enhances critical channels and spatial positions, effectively addressing the fine-grained problem of “large intra-class variations and small inter-class differences” in cotter pin detection. Compared with traditional attention mechanisms like SE or CBAM, PSA reduces computational complexity from O(C²) to O(C) through matrix reshaping operations (σ₁, σ₂), making it particularly suitable for YOLO architectures requiring multiple detection heads. Experimental results demonstrate that the P-C2f module increases parameters by only 3% while improving model performance. Heatmap visualizations reveal that gradient distributions in the PSA branch are more concentrated on geometric edges of cotter pins, whereas the original YOLOv8 model with pure C2f modules exhibits dispersed gradient patterns. This evidence confirms that PSA strengthens gradient backpropagation in discriminative regions through polarized weighting.

3.3. Embedding MCA into the Model

To further enhance the model’s feature representation ability, as well as its detection accuracy and efficiency for small targets, this paper adds an MCA module before each detection head in the YOLOv8 model, at positions 1, 2, 3, and 4 shown in Figure 4. The features processed by the backbone network are fed into the MCA module, further enhancing the image features, thereby improving the detection and classification ability of the detection heads and reducing the model’s false detection rate.

The structure of MCA [49] is shown in Figure 6, adopting a three-branch architecture, with each branch responsible for modeling attention along the channel, width, and height dimensions. Taking the first branch as an example, the entire process of this branch can be summarized as follows:

F_{W}^{'} = P M_{H}^{- 1} (F_{S G} (P M_{H} (T_{e x} (P M_{H}^{- 1} (T_{s q} (P M_{H} (F)))))) \otimes P M_{H} (F))

(6)

F \in ℝ^{C \times H \times W}

represents the input feature to the MCA, which is sent to each branch. In the first branch,

P M_{H} (*)

indicates a 90° rotation along the H direction (permute),

P M_{H}^{- 1} (*)

indicates a 90° clockwise rotation of the feature matrix along the H direction,

T_{s q} (*)

represents average pooling and standard pooling,

T_{e x} (*)

is the activation transformation with a 1 × K convolution kernel,

F_{S G} (*)

is the Sigmoid function, and

\otimes

represents matrix multiplication. The same operation is applied in the other spatial dimension along the height direction (H) to obtain

F_{H}^{'}

. The channel dimension (C), similar to the previous two channels but with two fewer rotation operations, results in

F_{C}^{'}

. Finally, the outputs from the three branches are averaged to obtain the feature map optimized with weights across different dimensions.

PMW-YOLOv8 integrates the MCA module before each detection head to enhance the feature map transmitted to the detection head and the subsequent layers, improving both channel and spatial dimensions. This enhancement mechanism strengthens the representation of small target features, ensuring that these fine-grained details are more pronounced during subsequent processing. Specifically, we integrate the MCA module after the P-C2f and C2f modules. Before the P-C2f and C2f modules, the Concat module receives the feature map from the previous layer, while the shallow feature map passes through the skip connection. These two feature maps are then concatenated to fuse deep and shallow features. The resulting fused feature map is subsequently passed into the P-C2f and C2f modules for further feature extraction.

F_{M C A} = \frac{1}{3} (F_{W}^{'} + F_{H}^{'} + F_{C}^{'}) F_{C 2 f}

(7)

On this basis, we integrate the MCA module to enhance the feature weights of the target region in the feature maps output by the C2f module. The mechanism of MCA in PM-YOLOv8 is described by Equation (7), where the output of MCA is represented as

F_{M C A}

, the input to MCA is the output of C2f, or P-C2f, and is represented as

F_{C 2 f}

. Small targets are typically concentrated in specific channels in the channel dimension, while background channels contribute less to the feature map. The channel branch applies convolution and pooling operations to the output

F_{C 2 f}

of either P-C2f or C2f, effectively compressing the channel features. It adjusts the channel weights

w_{C}

using the Sigmoid function to increase the attention on target-related channels while reducing the weights of irrelevant channels. The adjusted weights are then added to the original feature map via residual connections, focusing on the channel features that significantly contribute to small targets, ensuring their prominence is preserved in deeper layers and ultimately improving the accuracy of small-target detection. The exact process applies to the spatial dimension.

The integrated enhanced features

F_{M C A}

follow two processing paths: on the one hand, the enhanced features are directly passed to the corresponding detection head for classification and localization prediction of feature maps with different resolutions; on the other hand, these features are downsampled further, fused with deeper layer features, and then passed to deeper detection heads for target detection. This optimization mechanism enables the model to more effectively highlight target region features while suppressing irrelevant background interference, especially demonstrating advantages in small-target detection tasks. In subsequent experiments, we confirmed the performance improvement of the model by integrating MCA through ablation experiments. Additionally, we conducted comparative experiments, validating that the integration of MCA in this model outperforms other attention mechanisms for this task. Finally, we compared different MCA integration strategies and demonstrated the validity of the MCA integration approach proposed in this paper.

3.4. WIOU Loss

The loss function of YOLOv8 is divided into classification and regression branches. The classification loss function is the BCE loss function, which helps the model correctly classify the detected objects. Its computation is represented by Equation (8), where

y_{i}

is the actual label of the sample, and

p_{i}

is the model’s predicted probability. The regression loss combines CIOU and DFL (distribution focal loss), which helps the model locate the detected objects.

D F L (S_{i}, S_{i + 1}) = - ((y_{i + 1} - y) \log (S_{i}) + (y - y_{i}) \log (S_{i + 1}))

(8)

The calculation of the DFL loss function is as follows:

S_{i}

and

S_{i + 1}

are the model’s predicted values, and nearby predicted values

y

,

y_{i}

, and

y_{i + 1}

represent the nearby label value, predicted value, and true label value, respectively.

L o s s_{t o t a l} = η L o s s_{B C E} + μ L o s s_{D F L} + γ L o s s_{W I O U}

(9)

In our training dataset, some low-quality examples inevitably occur, such as cases where the IOU between anchor boxes and target boxes is low or where the resolution of the target and background is low. The geometric factors, such as distance and aspect ratio, in the CIOU used by YOLOv8 increase the penalty for these low-quality samples while requiring significant computational resources. Therefore, we replaced the CIOU loss function used in YOLOv8 with the WIOU loss function [51], reducing attention and interference when the detection box aligns well with the target box, thereby mitigating the penalty from geometric factors. This reduces the impact of low-quality samples on the model’s generalization performance, thereby improving the model’s generalization ability. The loss function of the improved model is shown in Equation (10):

L o s s_{t o t a l} = η L o s s_{B C E} + μ L o s s_{D F L} + γ L o s s_{W I O U}

(10)

In Equation (10), the three loss functions are weighted by proportional weights, where the hyperparameters a, b, and c represent the weights of the three loss functions. In this paper, a, b, and c are set to 7.5, 0.5, and 1.5, respectively. This is the WIOU loss function, as shown in Equation (11):

L o s s_{W I O U} = r R_{W I O U} L_{I O U}

(11)

R_{W I o U} = \exp (\frac{{(x - x_{g t})}^{2} + {(y - y_{g t})}^{2}}{{(W_{g}^{2} + H_{g}^{2})}^{*}})

(12)

Parameters

(x, y)

and

(x_{g t}, y_{g t})

represent the center points of the detection box and the ground truth box, respectively, while

W_{g}

and

H_{g}

are the width and height of the union region. When the detection box fits well with the ground truth box,

L_{I O U}

becomes smaller, significantly reducing the focus on the detection box with good matching performance. Conversely, when the detection box fits poorly, aligning with the ground truth box due to the presence of hard-to-classify or hard-to-predict samples during the training process,

R_{W I o U}

becomes smaller. This encourages the model to focus on the most regular samples during training, reducing the emphasis on both high-quality and low-quality samples, thereby enhancing the model’s generalization ability. The specific expression of the parameter is given in Equation (13).

r = \frac{β}{δ α^{β - δ}}

(13)

In Equation (13),

α

and δ are hyperparameters, where

α = 1.9

and

δ = 3

in this paper.

β

is a parameter expressing the degree of matching between two boxes, which is inversely proportional to the degree of match between the boxes. The specific expression of

β

is given in Equation (14).

L_{I O U}

is the moving average of the IOU loss, which can dynamically adjust the gradient based on the current degree of matching between the two boxes and formulate the most suitable gradient gain distribution strategy for the current situation. This ensures that the gradient descent speed remains relatively level, accelerating the training process and improving detection accuracy.

β = \frac{L_{I o U}^{*}}{\bar{L_{I O U}}}

(14)

4. Analysis of Experimental Results

4.1. Experimental Environment

The experimental hardware configuration includes a 12 vCPU Intel(R) Xeon(R) Silver 4214R CPU @ 2.40GHz, 32GB RAM (Intel Corporation, Santa Clara, CA, USA), and an NVIDIA 3080TI 12GB graphics card (NVIDIA Corporation, Santa Clara, CA, USA). The training and testing were conducted using the neural network framework Pytorch 2.4.1 + cu118. Furthermore, each model configuration (including the baseline YOLOv8 and PMW-YOLOv8) was independently executed 10 times under this environment with fixed hyperparameters (imgsz = 640, batch = 8, lr0 = 0.01, lrf = 0.001), and the reported results are the arithmetic means of these repeated experiments.

4.2. Evaluation Metrics

To assess the performance of the proposed PMW-YOLOv8 improved model, this paper uses three key evaluation metrics: precision (P), recall (R), and mean average precision at a 0.5 IOU threshold (mAP_0.5). These metrics reflect different aspects of the model’s detection capabilities and complement each other, creating a comprehensive evaluation system for the model’s overall performance.

4.3. Datasets

The dataset of this study originates from the 2023 transmission line refined inspection project of a power supply bureau in Yunnan Province. The original dataset contains 23,371 drone-captured transmission line images covering critical power equipment such as insulators, fittings, conductors, and vibration dampers. To ensure data quality, we selected images containing cotter pins and their defects (loosening and missing) from the original dataset, excluding blurred samples or those with ambiguous defect categories. A total of 4218 valid images were retained. The annotations were performed in YOLO format using the Python v3.8.0-based open-source tool LabelImg v1.8.6 under the guidance of professional power engineers. The annotated targets include three categories of cotter pin defects: normal, loosening, and missing. Due to the small proportion of loose and missing cotter pins in regular transmission lines, the sample distribution of defective cotter pins is extremely imbalanced compared to normal ones. To address this issue, we applied data augmentation techniques such as flipping, tilting, and adding noise to the images containing defective cotter pins. The final dataset contains 9108 images, which are split into training, validation, and test sets at a ratio of 6:2:2.

The cotter pin is already small in size. The resolution in standard power inspection images is often high due to the need to record most electrical components in detail. This high-resolution processing further reduces the relative size of the cotter pin in the image, undoubtedly increasing the difficulty of precisely detecting cotter pin defects. Inspired by SAHI [52], we adopted specific image-processing techniques to overcome this challenge during model training. Specifically, we divided the high-resolution inspection images into several smaller sub-images. As shown in Figure 7, we set the cropped sub-image size to 640 × 640 pixels to fit the model’s input. Additionally, we set a 20% overlap for the cropped sections. As shown in Figure 8a, the relative size of all target boxes is less than 20% of the image area, with the majority being less than 5%. Therefore, the 20% overlap ensures that the targets are not incomplete due to the cropping. On the other hand, we directly excluded sub-images with no targets in the training set to improve training efficiency.

The number of sub-images for each image in the cropping operation is shown in Equation (15).

N = N_{w} + N_{h}

(15)

Here,

N_{w}

and

N_{h}

represent the number of sub-images in the horizontal and vertical directions of the image, respectively.

W

and

H

are the width and height of the image in pixels, while

P_{w}

and

P_{h}

denote the pixel width and height of the sub-images.

α

and

β

are the overlap ratios of the sub-images in the horizontal and vertical directions, respectively, and

⌊*⌋

denotes the floor function.

N_{w} = ⌊\frac{W}{P_{w} + α P_{w}}⌋ + 1

(16)

N_{h} = ⌊\frac{H}{P_{h} + β P_{h}}⌋ + 1

(17)

After the image slicing, the number of sub-images in the training set, validation set, and test set are 12,977, 4217, and 4203, respectively. Figure 8b provides a visual analysis of the cotter pin target sizes in the sliced dataset. A comparison shows that, although the number of small targets remains the majority in the dataset, their relative size has increased significantly after slicing. This is beneficial for detecting cotter pin defects.

We conducted training and testing on the dataset before and after slicing, and the results are shown in Table 1. All performance metrics showed significant improvements, with precision and recall rates increasing by 6.6% and 14.3%, respectively. The AP for the three categories increased by 15.1%, 5.7%, and 16.4%, respectively, indicating significant performance improvements across different categories. Meanwhile, the overall mAP_0.5 metric increased by 13.6%, further confirming the general effectiveness of the slicing method in improving dataset performance. In a broader performance evaluation range, the mAP_0.5:0.95 metric increased by 10.9%. This result demonstrates the effectiveness of the slicing method in improving model performance and further proves the stability and reliability of this approach under different thresholds.

4.4. Ablation Study

4.4.1. Ablation Study on Model Improvements

We designed a series of ablation experiments to further investigate the effectiveness of the proposed model improvements and the independent contribution of each module to the model’s performance. These experiments gradually integrated the P2 detection head, P-C2f module, MCA module, and the use of the WIOU loss function, followed by quantitative evaluation and comparative analysis of the model’s performance. In all experiments, we used a unified training strategy and consistent hyperparameter settings to ensure the academic rigor and reproducibility of the results. The experimental results (as shown in Table 2) demonstrate the positive impact of each improvement on the model’s performance.

After integrating the small-object detection head (P2), the model’s precision increased by 1.3%, and the AP for detecting cotter pin looseness and cotter pin missing defects improved by 2.8% and 0.2%, respectively. The mAP_0.5 increased by 0.5%. After introducing the P-C2f module, the model’s mAP_0.5 metric significantly improved by 1.14 percentage points. Remarkably, for detecting cotter pin looseness and missing defects, the AP improved by 1.2% and 2.5%, respectively. This result strongly supports the effectiveness of the polarization self-attention mechanism in enhancing the model’s feature representation ability.

Subsequently, after further integrating the MCA module based on the introduction of the P-C2f module, the model’s performance received another significant improvement. The mAP_0.5 metric increased by 0.8 percentage points. In particular, for the detection tasks of cotter pin looseness and missing defects, precision increased by 1.8% and 0.6%, respectively. This result indicates that introducing the MCA module enhanced the model’s ability to effectively handle complex scenarios and capture multi-dimensional contextual information.

Finally, when the WIOU loss function was used during training, the model’s performance reached its optimal level. The mAP_0.5 metric reached 66.3%, representing an improvement of three percentage points compared to the baseline network. Additionally, other key performance metrics also showed significant improvements. This result not only validates the effectiveness of the WIOU loss function in optimizing the model training process, but also comprehensively demonstrates the significant impact of the proposed model improvement strategy on enhancing overall model performance.

The ablation study data, shown in Table 2, systematically validate the optimization of safety-critical metrics. The final model (OURS) achieved 5.9% and 4.7% AP gains for loosening and missing defects, respectively, demonstrating enhanced control over high-risk false negatives (FN). While the overall recall increased modestly from 58.1% to 58.8%, precision improved by 2.0%, confirming no false positive (FP) trade-off. Key module contributions include: WIOU loss alone boosted loosening AP by 4.2%; MCA further increased missing AP by 1.8%; and P2 + P-C2f integration improved missing AP by 4.7% over the baseline. These results, strictly derived from experimental data, validate the method’s reliability in power safety-critical scenarios.

After introducing the P2 small-object detection head to the baseline YOLOv8, the parameter count decreased by 0.8M, while the computational complexity increased by 19.4G FLOPs. This seemingly counterintuitive result arises from YOLOv8’s official design adjustment to the detection head channel count, balancing resolution and channel width to optimize performance (detailed explanation in Ultralytics’ GitHub discussion). The newly added MCA and P-C2f modules increased parameters by only 0.00003 M and 0.16 M, respectively, with FLOPs increments of 0.1 G and 0.7 G, while the WIOU loss function introduced no additional computational overhead as it involves no structural modifications. Ultimately, PMW-YOLOv8 reduced parameters by 0.64 M and increased FLOPs by 20.2 G compared to the baseline, achieving a balanced optimization of accuracy and efficiency.

4.4.2. Model Performance with Different Attention Modules

To fully validate the effectiveness and superiority of the MCA module in model optimization, we selected several classic attention modules, such as GAM [53], CA [54], CBAM [55], and ECA [56]. We conducted comparative experiments with the MCA module. We obtained the outcomes shown in Table 3 by comparing the detection results. It is evident from the data in the table that optimal performance was achieved across key performance metrics, such as accuracy and mAP_0.5 in the model incorporating the MCA module. Specifically, compared to the baseline model, the introduction of the MCA module led to a significant improvement of 0.3% in precision. Further analysis revealed that, in tasks targeting specific defect categories (such as cotter pin loosening and cotter pin missing), the inclusion of the MCA module resulted in significant performance improvements. Specifically, the AP for cotter pin loosening defects improved by 3.8%, and the AP for cotter pin missing defects increased by 1.4%. This indicates that the MCA module has strong optimization capabilities for subtle and important detection tasks. Moreover, the overall mAP_0.5 of the model was also improved by 1.5%, and this comprehensive performance boost fully validates the core value and excellent effect of the MCA module in model optimization. This series of improvements fully demonstrates the effectiveness of the MCA module in model optimization.

4.4.3. Validation of the Effectiveness of the MCA Module Insertion Position

To deeply explore the rationale behind the optimal insertion positions of the MCA module in the model, we carefully designed five sets of control experiments (see Table 4), labeled as groups a, b, c, d, and e, each representing a different application strategy for the MCA module. Among them, group d is the solution ultimately adopted in this study. In Table 4, numbers 1, 2, 3, and 4 correspond to the four potential MCA module insertion points marked in Figure 4, while the “√” symbol indicates that the MCA module is introduced at that position. Specifically, the solution adopted in this paper, group e, integrates the MCA module at all four positions shown in Figure 4. On the one hand, the enhanced features are directly fed into the corresponding detection modules for precise classification and localization predictions on multi-resolution feature maps; on the other hand, these features undergo further downsampling and are fused with deeper features, which are then processed by the deeper detection heads to perform the detection tasks, thereby achieving comprehensive feature enhancement for the detection head and completing the overall object detection task. The experimental results are summarized in Table 5.

Through comparative analysis, we observe that, compared to groups a, b, and c, the model using the d solution achieved significant advantages across several key performance metrics. These metrics include overall accuracy, mAP_0.5 for cotter pin loosening and missing defects, overall mAP_0.5, and the more comprehensive mAP_0.5:0.95 range evaluation. Group d demonstrated optimal performance in all of these areas. This confirms the rationality of the selected MCA module insertion positions in the model and further demonstrates the effectiveness of this solution in enhancing detection accuracy and robustness.

4.4.4. A Comparison of the Performance of Different IOU Loss Functions in a Model

To comprehensively validate the effectiveness and superiority of the WIOU loss function in optimizing YOLOv8 for object detection, we conducted comparative experiments by applying classical bounding box loss functions—GIOU [57], DIOU [58], CIOU, Inner-IOU [59], and NWD [60]—to the YOLOv8 model. As shown in Table 6, the YOLOv8 model with WIOU achieved optimal performance across all critical metrics: Precision-Recall Trade-off: WIOU attained a precision (P) of 78.3% and recall (R) of 57.9%, with a 0.2% improvement in precision over the CIOU baseline. Defect-Specific Performance: For the “loosening” category (e.g., cotter pin loosening), WIOU achieved an AP of 45.6%, surpassing CIOU by 4.2% and significantly outperforming NWD (43.4%). For the “missing” category (e.g., cotter pin missing), WIOU reached an AP of 59.9%, exceeding CIOU by 1% and NWD (59.1%).

Overall Detection Capability: The model achieved an mAP0.5 of 64.63%, representing a 1.3% improvement over the CIOU baseline and surpassing all other IOU variants. The mAP_0.5:0.95 reached 38.7%, demonstrating state-of-the-art performance. These results validate that WIOU enhances the model’s performance in cotter pin defect detection tasks by refining gradient descent strategies during training, particularly addressing challenges in industrial defect recognition.

4.4.5. Comparison of Heatmaps Before and After Model Improvement

We generated heatmaps for several power inspection images containing cotter pin targets using both the PMW-YOLOv8 model and the YOLOv8 model, as shown in Figure 9. These heatmaps visually show the models’ focus areas during the detection process. The improved network focuses more on the targets compared to the YOLOv8 model, with less attention given to irrelevant areas, resulting in a significant improvement in the detection of cotter pin targets.

4.5. Comparison Experiment

To comprehensively evaluate the performance of the PMW-YOLOv8 algorithm in cotter pin defect detection tasks, we selected a range of classic and currently advanced object detection algorithms, including, but not limited to, the classic YOLO [61] series networks, and two-stage algorithm Faster-RCNN [24], among others.

The model is evaluated by comparing the precision of each category and the overall mean average precision. The experimental results are shown in Table 7. The results show that compared with the YOLOv5 series models, Faster-RCNN [24] model, YOLOX model, YOLOv7 [62] model, and the latest YOLO series models such as YOLOv10 [63] and YOLO11, PMW-YOLOv8 shows relatively better performance in terms of precision (P), recall (R), mAP_0.5, and other indicators for different types of cotter pin defects. Specifically, PMW-YOLOv8 achieves the best performance among various detection models in P, R and AP for loosening and missing cotter pin defects, as well as overall mAP_0.5 and mAP_0.5~0.95.

It is worth mentioning that in the key indicator, mAP_0.5, PMW-YOLOv8 outperforms the original YOLOv8 model by 3.0%, and even when compared with YOLOv10, the best-performing model among all comparison models, it still achieves a 1.6% improvement. Moreover, due to the addition of extra attention modules, our model sacrifices some detection speed, but compared to two-stage models, our model still shows certain advantages in detection performance. Therefore, the proposed PMW-YOLOv8 model achieves good detection performance in the cotter pin defect dataset, achieving a mAP_0.5 of 66.3%, which is the highest among all compared models, and a precision of 80.1%, creating favorable conditions for intelligent inspection of cotter pin defects in power line patrols.

Through the analysis of metrics across different models, we observe the following: YOLO11 achieves the highest recall (60.8%) but a lower precision (76.4%) compared to YOLOv5s (precision: 79.6%, recall: 55.6%). This indicates that YOLO11m prioritizes detecting potential defects (e.g., “missing” with AP: 59.5%) through relaxed confidence thresholds or broader feature extraction, while YOLOv5s emphasizes reducing false positives (e.g., “normal” class AP: 87.3%) via stricter localization strategies. Tph-YOLOv5 sacrifices both precision (78.5%) and recall (54.9%) to prioritize real-time performance (FPS: 42.2), reflecting the inherent speed-accuracy trade-off in lightweight designs. Our method balances precision (78.1%), recall (58.1%), and speed (FPS: 74.3) by introducing a dedicated detection head for small objects and optimizing feature extraction and enhancement mechanisms.

4.6. Detection Results

In addition, we present the detection performance of our proposed model alongside other models in the following figures. We selected representative images to compare the detection results of different models. Since the cotter pin targets are relatively small, [original size display might fail to clearly show the cotter pin targets], therefore we only display the regions containing cotter pin targets from high-resolution images for presentation, as shown in Figure 10 and Figure 11 (ground truth). Meanwhile, since the cotter pin targets are relatively small, we use different colors for the detection boxes in the upcoming result displays to avoid label text overlapping with the targets. Specifically, the red boxes represent the normal (normal) cotter pin targets detected by the model, the green boxes represent the loosening (loosening) cotter pin targets, and the blue boxes represent the missing (missing) cotter pin targets.

From the detection results, we can observe the following: In Figure 10, the background is relatively simple and monotonous, containing 8 normal cotter pin targets. Our PMW-YOLOv8 model, like most models, correctly detected all the targets. In contrast, the Faster-RCNN model not only produced multiple false detections but also misclassified one normal cotter pin target as a loosening target.

In Figure 11, the complex structure of the transmission tower and the interwoven vegetation form a complex background, particularly the metal structure of the transmission tower, whose color is very close to that of the cotter pin, making detection more challenging. Compared to the scene in Figure 10, the targets in Figure 11 are relatively smaller, further increasing the difficulty of detection. This scene contains 14 cotter pin targets, including 13 regular targets and 1 missing target. Despite these challenges, our model performed well in this scenario, accurately detecting 13 out of the 14 normal cotter pin targets, with no misclassifications. Other models, such as YOLOX, Faster-RCNN, and RT-DETR, experienced missed detections, while models like YOLOv7, YOLOv8, and YOLO11 exhibited false detections.

Due to the relatively rare occurrence of cotter pin loosening and missing defects, there are typically only one or a few targets in each inspection image, which makes comparisons based on individual images less reliable. To improve the accuracy of the experiment, we selected multiple images containing scenes of cotter pin loosening and missing defects. We combined the detection results from each model by slicing and stitching the images. This increases the sample size, enabling a more comprehensive comparison of the detection performance of each model.

Figure 12 shows four scenes of cotter pin loosening, each scene containing one loosening target and two regular cotter pin targets. Our PMW-YOLOv8 model correctly detected all the samples. In contrast, other models did not perform as well. For instance, TPH-YOLOv5, YOLOv10, and YOLO11 exhibited false detections, misclassifying the loosening cotter pin as a regular cotter pin. In the second scene, TPH-YOLOv5, YOLOX, and YOLOv8 also misclassified the loosening cotter pin as a regular cotter pin, while YOLOv7 misclassified a regular bolt as a missing cotter pin. In the third scene, models such as YOLOv5, YOLOv8, and YOLOv10 all misclassified the loosening cotter pin as a regular cotter pin, while TPH-YOLOv5 and YOLOv7 misclassified a regular bolt as a missing cotter pin. In the fourth scene, YOLOv5m and TPH-YOLOv5 similarly exhibited false detections.

Figure 13 presents four detection scenes of cotter pin missing defects, including four missing cotter pin targets and nine normal cotter pin targets. Compared to other models, our PMW-YOLOv8 model achieved the best detection performance, correctly identifying all targets in the four scenes and demonstrating excellent recall and accuracy in detecting cotter pin defects. In the first scene, the Faster-RCNN model missed one missing cotter pin target. In the second scene, the YOLOv5s and TPH-YOLOv5 models missed one missing cotter pin target, while YOLOv7 and Faster-RCNN misclassified one missing cotter pin target as a normal cotter pin target. In the third scene, models such as YOLOv5, YOLOv8, YOLOX, YOLOv10, and YOLO11 missed one normal cotter pin target, while YOLOv7 misclassified one normal cotter pin target as a missing cotter pin target. In the fourth scene, YOLOv5, YOLOX, YOLOv10, and RT-DETR models missed one normal cotter pin target.

To demonstrate the generalization capability of the PMW-YOLOv8 model, we selected an image where both YOLOv8 and PMW-YOLOv8 produced identical detection results under normal weather conditions, as shown in Figure 14. Both models detected seven cotter pin targets in clear weather. By simulating scenarios with snow, rain, thunderstorms, and dense fog, we compared the changes in detection results between the two models. Under simulated rain and thunderstorm conditions, both models exhibited identical detection results, successfully identifying all cotter pins. However, in snowy conditions, YOLOv8 detected six cotter pins with one missed detection, while PMW-YOLOv8 detected all targets. In dense fog conditions, both models experienced missed detections: YOLOv8 detected only three cotter pins, whereas PMW-YOLOv8 detected five. These results demonstrate the superior generalization performance of PMW-YOLOv8 under diverse weather conditions.

5. Conclusions

This paper addresses the minor object detection issue faced by existing cotter pin detection methods by segmenting high-resolution images into several sub-images during the model training phase. This increases the relative size of cotter pin targets in the images, improving detection accuracy. In the PMW-YOLOv8 model, a small-object detection head was added, which fully utilizes the edge information and shallow features of small objects to enhance detection precision.

Additionally, we integrated an MCA module into each of the four target detection heads (P2–P5), further enhancing the extracted features. This integration improves the model’s ability to detect and classify small objects. To address the fine-grained issues in cotter pin detection, we introduced a polarized self-attention mechanism (PSA) on top of the C2f module, proposing a P-C2f module. This module is incorporated into the model to strengthen the extraction and processing of fine-grained information, thereby enhancing detection accuracy for fine-grained cotter pin targets.

The proposed PMW-YOLOv8 was tested on the cotter pin defect dataset, and the experimental results show that PMW-YOLOv8 outperforms the native YOLOv8 algorithm in key metrics such as AP and mAP_0.5 when detecting common defects such as loose and missing cotter pins. Moreover, compared to other classic detection algorithms, PMW-YOLOv8 achieves satisfactory results in precision, AP, mAP_0.5, and mAP_0.5:0.95.

Overall, PMW-YOLOv8 demonstrates better detection performance and effectively improves the detection of cotter pin defects in power transmission line inspections.

Although the proposed PMW-YOLOv8 demonstrates improved performance in cotter pin defect detection for power transmission lines, several challenges remain to be addressed in future research:

(1) Sensitivity to adverse weather conditions: Current methods may suffer from performance degradation due to image quality deterioration caused by rain, snow, or fog. To enhance model robustness in harsh weather scenarios, we will explore weather-invariant feature learning through generative adversarial training and physics-based image restoration techniques.

(2) Edge deployment constraints: The increased model complexity (parameters and computations) for achieving high detection accuracy could limit real-time processing on edge devices such as UAVs and mobile platforms. Future work will focus on lightweight network architecture design and model compression techniques (e.g., pruning and quantization) while maintaining competitive performance, thereby enabling efficient deployment on resource-constrained devices.

Author Contributions

Conceptualization, P.W., Z.Z. and J.R.; Methodology, P.W., Z.Z. and J.R.; Resources, Y.M.; Software, P.W., Z.Z. and J.R.; Supervision, H.Z.; Validation, J.R.; Writing—original draft, P.W., Z.Z. and J.R.; Writing—review and editing, H.Z. and G.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Yunnan Province Special Fund for Key Programs in Science and Technology, China, grant number 202202AD080004 and the Natural Science Foundation of China (Grant Nos. 12263008).

Data Availability Statement

Data were obtained from Yunnan Power Grid Co., Ltd. and are available from the authors with the permission of Yunnan Power Grid Co., Ltd.

Acknowledgments

We would like to extend our sincere gratitude to Yi Ma for his support throughout the project. Yi Ma not only guided the design and implementation of the experiments, but also ensured that all necessary resources were appropriately allocated, laying a solid foundation for the success of this research. Additionally, his review and final approval of the manuscript were critical and greatly enhanced its quality. We express our deepest appreciation to Yi Ma.

Conflicts of Interest

The author Yi Ma was employed by the company Yunnan Power Grid Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Liao, K.; Lau, J.; Hidayat, M. An Innovative Aircraft Skin Damage Assessment Using You Only Look Once-Version9: A Real-Time Material Evaluation System for Remote Inspection. Aerospace 2025, 12, 31. [Google Scholar] [CrossRef]
Gong, Y.; Zhou, W.; Wang, K.; Wang, J.; Wang, R.; Deng, H.; Liu, G. Defect detection of small cotter pins in electric power transmission system from UAV images using deep learning techniques. Electr. Eng. 2023, 105, 1251–1266. [Google Scholar]
Yi, Y.; Zhou, Z.; Wu, Y. Defect Detection of Transmission Line Bolts Based on Improved YOLOv5. In Proceedings of the 2023 3rd International Conference on Electronic Information Engineering and Computer Science (EIECS), Changchun, China, 22–24 September 2023; pp. 355–359. [Google Scholar]
Li, Y.; Liu, M.; Li, Z.; Jiang, X. CSSAdet: Real-Time end-to-end small object detection for power transmission line inspection. IEEE Trans. Power Deliv. 2023, 38, 4432–4442. [Google Scholar] [CrossRef]
Li, Y.; Li, Z.; Liu, Y.; Sheng, G.; Jiang, X. Pin bolt state identification using cascaded object detection networks. Front. Energy Res. 2022, 10, 813945. [Google Scholar]
Gao, F.; Zhang, R.; Tang, J.; Liu, S.; Li, W.; Yu, J.; Chen, C.; Zheng, H. Pin-CasNet: Detecting pin status in transmission lines based on cascade network. Eng. Appl. Artif. Intell. 2024, 127, 107244. [Google Scholar]
Lin, T.Y.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar] [CrossRef]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 16965–16974. [Google Scholar]
Lin, T.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common Objects in Context. In Proceedings of the 13th European Conference, Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
Wei, L.; Dragomir, A.; Dumitru, E.; Christian, S.; Scott, R.; Cheng-Yang, F.; Berg, A.C. SSD: Single Shot MultiBox Detector; Springer: Cham, Switzerland, 2016. [Google Scholar]
Ding, Y.; Yuan, G.; Zhou, H.; Wu, H. ESF-DETR: A real-time and high-precision detection model for cigarette appearance. J. Real Time Image Process. 2025, 22, 54. [Google Scholar] [CrossRef]
Bell, S.; Zitnick, C.L.; Bala, K.; Girshick, R. Inside-Outside Net: Detecting Objects in Context with Skip Pooling and Recurrent Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2874–2883. [Google Scholar]
Simard, P.Y.; Steinkraus, D.; Platt, J.C. Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis. In Proceedings of the 7th International Conference on Document Analysis and Recognition, Edinburgh, UK, 3–6 August 2003; pp. 958–962. [Google Scholar]
Yaeger, L.; Lyon, R.; Webb, B. Effective Training of a Neural Network Character Classifier for Word Recognition. In Proceedings of the 10th International Conference on Neural Information Processing Systems (NIPS’96), Denver, CO, USA, 3–5 December 1996; MIT Press: Cambridge, MA, USA, 1996; pp. 807–813. [Google Scholar]
Kisantal, M. Augmentation for Small Object Detection. arXiv 2019, arXiv:1902.07296. [Google Scholar]
Yu, X.; Gong, Y.; Jiang, N.; Ye, Q.; Han, Z. Scale Match for Tiny Person Detection. In Proceedings of the 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), Snowmass, CO, USA, 1–5 March 2020; pp. 1246–1254. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.; Liao, H.M. Yolov4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Liu, S.; Huang, D.; Wang, Y. Learning Spatial Fusion for Single-Shot Object Detection. arXiv 2019, arXiv:1911.09516. [Google Scholar]
Zhang, K.; Zhang, Z.; Li, Z.; Qiao, Y. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Process. Lett. 2016, 23, 1499–1503. [Google Scholar] [CrossRef]
Cai, Z.; Fan, Q.; Feris, R.S.; Vasconcelos, N. A Unified Multi-Scale Deep Convolutional Neural Network for Fast Object Detection. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 354–370. [Google Scholar]
Wei, X.; Song, Y.; Mac Aodha, O.; Wu, J.; Peng, Y.; Tang, J.; Yang, J.; Belongie, S. Fine-grained image analysis with deep learning: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 8927–8948. [Google Scholar] [CrossRef] [PubMed]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar]
Zhang, H.; Xu, T.; Elhoseiny, M.; Huang, X.; Zhang, S.; Elgammal, A.; Metaxas, D. Spda-cnn: Unifying Semantic Part Detection and Abstraction for Fine-Grained Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 1143–1152. [Google Scholar]
Zhang, N.; Donahue, J.; Girshick, R.; Darrell, T. Part-based R-CNNs for Fine-Grained Category Detection. In Proceedings of the 13th European Conference, Zurich, Switzerland, 6–12 September 2014; pp. 834–849. [Google Scholar]
Lin, D.; Shen, X.; Lu, C.; Jia, J. Deep lac: Deep Localization, Alignment and Classification for Fine-Grained Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1666–1674. [Google Scholar]
Branson, S.; Van Horn, G.; Belongie, S.; Perona, P. Bird species categorization using pose normalized deep convolutional nets. arXiv 2014, arXiv:1406.2952. [Google Scholar]
Zhang, Y.; Wei, X.; Wu, J.; Cai, J.; Lu, J.; Nguyen, V.; Do, M.N. Weakly supervised fine-grained categorization with part-based image representation. IEEE Trans. Image Process. 2016, 25, 1713–1725. [Google Scholar]
Ge, W.; Lin, X.; Yu, Y. Weakly supervised complementary parts models for fine-grained image classification from the bottom up. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 3034–3043. [Google Scholar]
Peng, Y.; He, X.; Zhao, J. Object-part attention model for fine-grained image classification. IEEE Trans. Image Process. 2017, 27, 1487–1500. [Google Scholar]
Wang, Z.; Wang, S.; Li, H.; Dou, Z.; Li, J. Graph-Propagation Based Correlation Learning for Weakly Supervised Fine-Grained Image Classification. In Proceedings of the AAAI conference on artificial intelligence, New York, NY, USA, 7–12 February 2020; pp. 12289–12296. [Google Scholar]
Liu, C.; Xie, H.; Zha, Z.; Ma, L.; Yu, L.; Zhang, Y. Filtration and distillation: Enhancing Region Attention for Fine-Grained Visual Categorization. In Proceedings of the AAAI Conference on Artificial Intelligence; Association for the Advancement of Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 11555–11562. [Google Scholar]
Ding, Y.; Zhou, Y.; Zhu, Y.; Ye, Q.; Jiao, J. Selective sparse sampling for fine-grained image recognition. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27–28 October 2019; pp. 6599–6608. [Google Scholar]
Wang, Y.; Morariu, V.I.; Davis, L.S. Learning a discriminative filter bank within a CNN for fine-grained recognition. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 4148–4157. [Google Scholar]
Huang, Z.; Li, Y. Interpretable and Accurate Fine-Grained Recognition via Region Grouping. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 8662–8672. [Google Scholar]
Zheng, H.; Fu, J.; Zha, Z.; Luo, J. Looking for the Devil in the Details: Learning Trilinear Attention Sampling Network for Fine-Grained Image Recognition. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 5012–5021. [Google Scholar]
Ji, R.; Wen, L.; Zhang, L.; Du, D.; Wu, Y.; Zhao, C.; Liu, X.; Huang, F. Attention Convolutional Binary Neural Tree for Fine-Grained Visual Categorization. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10468–10477. [Google Scholar]
Li, P.; Xie, J.; Wang, Q.; Zuo, W. Is Second-Order Information Helpful for Large-Scale Visual Recognition? In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2070–2078. [Google Scholar]
Engin, M.; Wang, L.; Zhou, L.; Liu, X. Deepkspd: Learning Kernel-Matrix-Based SPD Representation for Fine-Grained Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 612–627. [Google Scholar]
Dubey, A.; Gupta, O.; Raskar, R.; Naik, N. Maximum-entropy fine grained classification. In Proceedings of the Advances in Neural Information Processing Systems 31 (NeurIPS 2018), Montréal, QC, Canada, 3–8 December 2018. [Google Scholar]
Gao, Y.; Han, X.; Wang, X.; Huang, W.; Scott, M. Channel Interaction Networks for Fine-Grained Image Categorization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 10818–10825. [Google Scholar]
Deng, J.; Krause, J.; Stark, M.; Fei-Fei, L. Leveraging the wisdom of the crowd for fine-grained recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 666–676. [Google Scholar] [CrossRef] [PubMed]
Reed, S.; Akata, Z.; Lee, H.; Schiele, B. Learning Deep Representations of Fine-Grained Visual Descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 49–58. [Google Scholar]
He, X.; Peng, Y. Fine-grained image classification via combining vision and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), Honolulu, HI, USA, 21–26 July 2017; pp. 5994–6002. [Google Scholar]
Song, K.; Wei, X.; Shu, X.; Song, R.; Lu, J. Bi-modal progressive mask attention for fine-grained recognition. IEEE Trans. Image Process. 2020, 29, 7006–7018. [Google Scholar]
Cui, Y.; Zhou, F.; Lin, Y.; Belongie, S. Fine-grained categorization and dataset bootstrapping using deep metric learning with humans in the loop. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), Las Vegas, NV, USA, 27–30 June 2016; pp. 1153–1162. [Google Scholar]
Zanzotto, F.M. Human-in-the-loop artificial intelligence. J. Artif. Intell. Res. 2019, 64, 243–252. [Google Scholar]
Yu, Y.; Zhang, Y.; Cheng, Z.; Song, Z.; Tang, C. MCA: Multidimensional collaborative attention in deep convolutional neural networks for image recognition. Eng. Appl. Artif. Intell. 2023, 126, 107079. [Google Scholar] [CrossRef]
Liu, H.; Liu, F.; Fan, X.; Huang, D. Polarized self-attention: Towards High-Quality Pixel-Wise Regression. arXiv 2021, arXiv:2107.00782. [Google Scholar]
Tong, Z.; Chen, Y.; Xu, Z.; Yu, R. Wise-IOU: Bounding Box Regression Loss with Dynamic Focusing Mechanism. arXiv 2023, arXiv:2301.10051. [Google Scholar]
Akyon, F.C.; Altinuc, S.O.; Temizel, A. Slicing aided hyper inference and fine-tuning for small object detection. In Proceedings of the 2022 IEEE International Conference on Image Processing (ICIP), Bordeaux, France, 18–21 September 2022; pp. 966–970. [Google Scholar]
Liu, Y.; Shao, Z.; Hoffmann, N. Global Attention Mechanism: Retain Information to Enhance Channel-Spatial Interactions. arXiv 2021, arXiv:2112.05561. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 13708–13717. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module; Springer: Cham, Switzerland, 2018. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11531–11539. [Google Scholar]
Rezatofighi, H.; Tsoi, N.; Gwak, J.Y.; Sadeghian, A.; Savarese, S. Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IOU Loss: Faster and Better Learning for Bounding Box Regression. arXiv 2019, arXiv:1911.08287. [Google Scholar]
Zhang, H.; Xu, C.; Zhang, S. Inner-IOU: More Effective Intersection over Union Loss with Auxiliary Bounding Box. arXiv 2023, arXiv:2311.02877. [Google Scholar]
Wang, J.; Xu, C.; Yang, W.; Yu, L. A Normalized Gaussian Wasserstein Distance for Tiny Object Detection. arXiv 2021, arXiv:2110.13389. [Google Scholar]
Redmon, J. You only look once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Wang, C.; Bochkovskiy, A.; Liao, H.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada; 2023; pp. 7464–7475. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. Yolov10: Real-time end-to-end object detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
Zhu, X.; Lyu, S.; Wang, X.; Zhao, Q. TPH-YOLOv5: Improved YOLOv5 Based on Transformer Prediction Head for Object Detection on Drone-Captured Scenarios. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, Montreal, QC, Canada, 11–17 October 2021; pp. 2778–2788. [Google Scholar]
Zheng, G.; Songtao, L.; Feng, W.; Zeming, L.; Jian, S. YOLOX: Exceeding YOLO series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]

Figure 1. An inspection image containing two cotter pins. Red bounding boxes identify cotter pin targets.

Figure 2. Different cotter pin states.

Figure 3. Research approach in this paper.

Figure 4. PMW-YOLOv8 network model structure. The blue dashed boxes denote the newly added P2 small-object detection head and its associated feature pyramid module. The red solid boxes represent the improved P-C2f module integrated with polarized self-attention (PSA). The purple solid boxes indicate the multi-scale channel attention (MCA) modules embedded before the four detection heads (P2–P5). Black arrows illustrate the multi-scale feature fusion paths from the backbone network to the detection heads. The color coding of modules is consistent with the descriptions in Section 3 of the main text.

Figure 5. Schematic architecture of the improved P-C2f with polarized self-attention (PSA). (a) Proposed P-C2f module; (b) PSA-Bottleneck module: the PSA module is inserted after the second convolutional layer in the Bottleneck module, enhancing fine-grained feature extraction; (c) structure of the PSA module.

Figure 6. Integration of multi-scale channel attention (MCA) into the model. (a) MCA modules embedded before the detection heads; (b) architectural details of the MCA module.

Figure 7. Sub-image splitting training. (a) Illustration of overlapping sliding-window cropping; (b) Demonstration of target-retained sub-images containing cotter pins.

Figure 8. Comparison of target sizes before and after dataset slicing. (a) Statistics of relative image dimensions before sliced cropping. (b) Histogram of relative image dimensions before sliced cropping.

Figure 9. Comparison of detection result heatmaps. the red boxes represent the normal cotter pin targets detected by the model, the green boxes represent the loosening cotter pin targets.

Figure 10. Detection results in simple background images.

Figure 11. Detection results in complex background images.

Figure 12. Detection results of cotter pin loosening targets in different backgrounds.

Figure 13. Detection results of cotter pin missing targets in different backgrounds.

Figure 14. PMM-YOLOv8 vs YOLOv8: detection results under different weather conditions.

Table 1. Results before and after slicing.

Method	Precision (P, %)	Recall (R, %)	AP (%)			mAP_0.5 (%)	mAP_0.5:0.95 (%)
Method	Precision (P, %)	Recall (R, %)	Normal	Loosening	Missing	mAP_0.5 (%)	mAP_0.5:0.95 (%)
YOLOv8	71.5	43.8	73.4	35.7	42.5	50.7	26.2
+Slice	78.1	58.1	88.5	41.4	58.9	63.3	37.1

Table 2. Ablation study results on the cotter pin defect dataset.

Method	P2	P-C2f	MCA	WIOU	Precision (P)	Recall (R)	AP (%)			mAP_0.5 (%)	Params	GFLOPs
Method	P2	P-C2f	MCA	WIOU	Precision (P)	Recall (R)	Normal	Loosening	Missing	mAP_0.5 (%)	Params	GFLOPs
YOLOv8					78.1	58.1	88.5	41.4	58.9	63.3	25.858057	79.1
	√				79.4	57.5	88.2	44.2	59.1	63.8	25.05262	98.5
			√		78.4	56.8	88.3	45.2	60.3	64.8	25.858087	79.1
				√	78.3	57.9	88.1	45.6	59.9	64.6	25.858057	79.1
	√	√			79.4	57.0	88.2	45.4	61.6	65.0	25.211936	99.2
	√	√	√		79.4	58.3	88.1	47.2	62.2	65.8	25.211976	99.3
OURS	√	√	√	√	80.1	58.8	87.3	47.3	63.7	66.3	25.211976	99.3

Table 3. Model performance with different attention modules.

Method	Precision (P)	Recall (R)	AP (%)			mAP_0.5 (%)	mAP_0.5:0.95 (%)
Method	Precision (P)	Recall (R)	Normal	Loosening	Missing	mAP_0.5 (%)	mAP_0.5:0.95 (%)
Baseline/YOLOv8	78.1	58.1	88.5	41.4	58.9	63.3	37.1
GAM	77.9	58.4	88.3	41.3	60.3	63.3	37.7
CA	76.8	60.3	88.7	43.7	60.2	64.3	37.8
CBAM	78.3	56.2	87.5	44.0	58.3	63.2	38.2
ECA	78.2	57.6	87.7	44.6	58.8	63.7	38.3
SE	76.3	56.5	87.2	43.9	59.1	63.5	37.6
+MCA	78.4	56.8	88.3	45.2	60.3	64.8	38.8

Table 4. Different embedding strategies of the MCA module.

Method	1	2	3	4
a	√
b				√
c			√	√
d		√	√	√
e (our)	√	√	√	√

1, 2, 3, and 4 represent the MCA modules in Figure 4.

Table 5. Ablation experiment on the insertion position of the MCA module.

Method	Precision (P)	Recall (R)	AP (%)			mAP_0.5 (%)	mAP_0.5:0.95 (%)
Method	Precision (P)	Recall (R)	Normal	Loosening	Missing	mAP_0.5 (%)	mAP_0.5:0.95 (%)
Baseline/YOLOv8	78.1	58.1	88.5	41.4	58.9	63.3	37.1
a	77.9	58.0	87.4	44.4	61.3	64.4	38.8
b	76.3	57.2	88.2	44.3	60.2	64.0	38.2
c	77.6	56.1	86.9	44.6	60.9	64.2	38.5
d	78.2	56.4	86.4	41.7	58.1	63.3	37.2
e (our)	78.4	56.8	88.3	45.2	60.3	64.8	38.8

Table 6. Comparison of the performance of different IOU loss functions in a model.

Method	Precision (P)	Recall (R)	AP (%)			mAP_0.5 (%)	mAP_0.5:0.95 (%)
Method	Precision (P)	Recall (R)	Normal	Loosening	Missing	mAP_0.5 (%)	mAP_0.5:0.95 (%)
YOLOv8 (CIOU)	78.1	58.1	88.5	41.4	58.9	63.3	37.1
GIOU	71.5	58.2	88.7	40.7	54.3	61.2	35.9
DIOU	73.7	56.2	88.3	38.3	52.0	59.6	35.1
Inner-IOU	73.9	56.4	87.8	38.3	54.8	60.3	35.5
NWD	78.0	56.7	88.1	43.4	59.1	63.5	38.0
+WIOU	78.3	57.9	88.1	45.6	59.9	64.6	38.7

Table 7. Experimental results of different models on the cotter pin defect dataset.

Method	Precision (P)%	Recall (R)%	AP (%)			mAP_0.5 (%)	mAP_0.5:0.95 (%)	FPS
Method	Precision (P)%	Recall (R)%	Normal	Loosening	Missing	mAP_0.5 (%)	mAP_0.5:0.95 (%)	FPS
Baseline/YOLOv8	78.1	58.1	88.5	41.4	58.9	63.3	37.1	74.3
YOLOv5s	79.6	55.6	87.3	44.0	55.6	62.3	36.9	85.0
YOLOv5m	79.3	58.2	87.8	45.6	61.1	63.1	39.2	66.2
tph-YOLOv5 [64]	78.5	54.9	86.6	37.9	49.5	58.0	36.6	42.2
YOLOv7 [62]	63.3	53.1	84.1	39.5	41.2	54.9	30.2	55.6
YOLOv10m [63]	80.1	55.2	88.2	45.9	60.1	64.7	39.4	49.8
YOLO11m	76.4	60.8	89.2	44.8	59.5	64.6	39.0	46.9
YOLOX [65]	58.0	53.8	85.9	35.5	51.7	56.8	32.9	44.4
Faster-RCNN [24]	47.9	44.8	75.2	31.8	60.6	58.6	33.3	35.8
RT-DETR-L [8]	76.0	53.5	82.3	31.3	43.8	52.4	28.2	53.9
RetinaNet	76.2	55.4	82.6	41.8	61.4	61.9	35.7	43.1
OURS	80.1	58.8	87.3	47.3	63.7	66.3	40.3	45.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, P.; Yuan, G.; Zhang, Z.; Rao, J.; Ma, Y.; Zhou, H. Detection of Cotter Pin Defects in Transmission Lines Based on Improved YOLOv8. Electronics 2025, 14, 1360. https://doi.org/10.3390/electronics14071360

AMA Style

Wang P, Yuan G, Zhang Z, Rao J, Ma Y, Zhou H. Detection of Cotter Pin Defects in Transmission Lines Based on Improved YOLOv8. Electronics. 2025; 14(7):1360. https://doi.org/10.3390/electronics14071360

Chicago/Turabian Style

Wang, Peng, Guowu Yuan, Zhiqin Zhang, Junlin Rao, Yi Ma, and Hao Zhou. 2025. "Detection of Cotter Pin Defects in Transmission Lines Based on Improved YOLOv8" Electronics 14, no. 7: 1360. https://doi.org/10.3390/electronics14071360

APA Style

Wang, P., Yuan, G., Zhang, Z., Rao, J., Ma, Y., & Zhou, H. (2025). Detection of Cotter Pin Defects in Transmission Lines Based on Improved YOLOv8. Electronics, 14(7), 1360. https://doi.org/10.3390/electronics14071360

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Detection of Cotter Pin Defects in Transmission Lines Based on Improved YOLOv8

Abstract

1. Introduction

2. Related Work

2.1. Small Object Detection

2.2. Fine-Grained Object Detection

3. PMW-YOLOv8 Network Model Structure

3.1. A Small Target Detection Layer

3.2. P-C2f: Integrating PSA into C2f

3.3. Embedding MCA into the Model

3.4. WIOU Loss

4. Analysis of Experimental Results

4.1. Experimental Environment

4.2. Evaluation Metrics

4.3. Datasets

4.4. Ablation Study

4.4.1. Ablation Study on Model Improvements

4.4.2. Model Performance with Different Attention Modules

4.4.3. Validation of the Effectiveness of the MCA Module Insertion Position

4.4.4. A Comparison of the Performance of Different IOU Loss Functions in a Model

4.4.5. Comparison of Heatmaps Before and After Model Improvement

4.5. Comparison Experiment

4.6. Detection Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI