PGE-YOLO: A Multi-Fault-Detection Method for Transmission Lines Based on Cross-Scale Feature Fusion

Cai, Zixuan; Wang, Tianjun; Han, Weiyu; Ding, Anan

doi:10.3390/electronics13142738

Open AccessArticle

PGE-YOLO: A Multi-Fault-Detection Method for Transmission Lines Based on Cross-Scale Feature Fusion

¹

School of Computer Science and Technology, Xinjiang University, Urumqi 830000, China

²

State Grid XinJiang Electric Power Co., Ltd., Urumqi 830000, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(14), 2738; https://doi.org/10.3390/electronics13142738

Submission received: 1 June 2024 / Revised: 27 June 2024 / Accepted: 10 July 2024 / Published: 12 July 2024

Download

Browse Figures

Versions Notes

Abstract

:

Addressing the issue of incorrect and missed detections caused by the complex types, uneven scales, and small sizes of defect targets in transmission lines, this paper proposes a defect-detection method based on cross-scale feature fusion, PGE-YOLO. Firstly, feature extraction is enriched by replacing the convolutional blocks in the backbone network that need to be cascaded and fused using the Par_C2f network module, which incorporates a parallel network (ParNet). Secondly, a four-layer efficient multi-scale attention (EMA) mechanism is incorporated into the network’s neck to address long and short dependency issues. This enhancement aims to improve global information retention by employing parallel substructures and integrating cross-space feature information. Finally, the paradigm of generalized feature fusion (GFPN) is introduced and reconfigured to develop a novel CE-GFPN. This model effectively integrates shallow feature information with deep feature information to enhance the capability of feature fusion and improve detection performance. Using a real transmission line multi-defect dataset from UAV aerial photography and the CPLID dataset, ablation and comparison experiments with various models demonstrated that our model achieved superior results. Compared to the initial YOLOv8n model, our model increased the detection accuracy by 6.6% and 1.2%, respectively, while ensuring there is no surge in the number of parameters. This ensures that the real-time and accuracy requirements for defect detection in the industry are satisfied.

Keywords:

deep learning; defect detection; transmission lines; feature fusion; YOLOv8

1. Introduction

As an important medium for transmitting electric energy, transmission lines play a decisive role between power plants and users. Intact transmission lines are crucial for the advancement of society and the overall well-being of people’s livelihoods. However, transmission lines are typically located and constructed in exposed outdoor environments, facing complex and changing natural conditions. The numerous types of equipment and devices can easily develop unexpected defects, thus affecting the entire power system. Therefore, defect detection of transmission lines is a top priority for grid units. Historically, transmission line defect detection has mostly relied on manual methods, where human observers confirm the existence of defects. This approach obviously poses security risks and consumes significant human resources. With the development of UAV technology, transmission line detection has shifted towards automatic detection using UAV equipment [1]. However, this method still requires substantial image data transmission to a central cloud, where human observers are used to confirm the defects. While this reduces safety risks, it consumes considerable time and energy. In recent years, the continuous development of deep learning has led many researchers to shift their focus from traditional defect detection to utilizing deep learning-based techniques [2] for end-to-end automatic detection. The whole process of transmission line defect detection based on deep learning is shown in Figure 1.

Generally, defect detection based on deep learning can be divided into two mainstream categories: one-stage detection and two-stage detection, both of which are widely used for detecting defects in transmission lines. Zhao et al. [3] designed a defect-detection method for transmission line insulators based on the improved Faster RCNN [4] model, utilizing a Feature Pyramid Network (FPN) [5] to enhance the network structure and locate targets in complex backgrounds. Liu et al. [6] utilized Mask-RCNN [7] for transmission-equipment-detection recognition, using RoI Align to replace RoI Pool to retain more geometric location information and improve detection accuracy. Dong et al. [8] proposed an enhanced Cascade RCNN [9] algorithm for detecting in key components of transmission lines. They introduced the Swin-V2 [10] module and the Balanced Feature Pyramid (BFP) [11] structure to improve model accuracy. Although the aforementioned two-stage algorithm yields favorable results in detecting transmission line defects, it comes at the cost of a significant reduction in detection speed compared to the one-stage algorithm, along with increased complexity. This trade-off is considered unacceptable by industry standards. Therefore, one-stage defect-detection models, represented by the You Only Look Once (YOLO) [12] family, are more popular. Liu et al. [13] utilized an enhanced YOLOv4 [14] model for detecting defects in transmission lines. They incorporated a new fourth layer of detection heads into the model to enhance its capability to detect small targets. This modification aimed to improve the overall detection performance of the model, particularly for smaller objects. Song et al. [15] developed an improved YOLOv5 model to detect transmission line component defects and introduced a CBAM [16] attention module to improve detection accuracy by capturing important features and suppressing unimportant ones. Zheng et al. [17] proposed an enhancement to the YOLOv7 [18] model for detecting insulator defects by introducing the SioU [19] and focal loss functions to expedite the model’s convergence and address the issue of unbalanced positive and negative samples, resulting in favorable outcomes. Wu et al. [20] presented an improved YOLOv8 model for detecting damage in overhead transmission lines. Their approach involved incorporating a lightweight GSConv [21] convolution module and a Slim Neck network structure, effectively reducing the complexity and computational load of the model while maintaining high performance. Undeniably, these studies provide significant assistance in the detection of transmission line defects. However, these methods present challenges in more complex environments, with wider ranges of defect scales and very small defect sizes. Specifically, for the tilting problem with the grading ring, the inconsistency in the angular magnitude of the tilting necessitates increased attention to spatial scale variations. For the pin substitution issue, the subtle difference in scale between normal pins and defective substitution pins requires fine-grained feature extraction. Additionally, when dealing with missing gaskets, which occupy only single-digit pixel sizes compared to the whole image, it is important for the deep network to retain key feature information as much as possible through feature fusion. These areas require further research.

Therefore, it is crucial to investigate a method for detecting multiple defects in transmission lines while achieving accurate results for targets with significant scale differences and small sizes. In response to these challenges, this paper proposes an efficient defect detection model based on cross-scale feature fusion designed using YOLOv8. The main contributions of this study are as follows:

We designed an improved backbone, ParNet, which forms a parallel branch structure through convolutions of different sizes. We leveraged the advantages of the ParNet design to enhance the weak representation of the original YOLOv8 backbone. This module was added to the residual sub-module DarknetBottle of C2f, creating a new backbone network module called Par_C2f. Specifically, by utilizing this structure without additional depth, we can extract features and increase the receptive field, resulting in significant improvements in model performance.
A four-layer architecture multi-scale attention module was strategically designed at the network’s neck. This involves reshaping a portion of the channels to the batch dimension and grouping the channel dimensions within each feature group. The aim is to achieve a uniform distribution of spatial semantic features in each feature group while also mitigating the potential side effects of channel dimension reduction during deep feature representation extraction. Furthermore, the internal network design of the EMA’s three branches enables cross-space feature fusion. Additionally, the four-layer architecture is designed so that three of the layers directly output to the detection head, allowing the model to focus on key target information.
The neck feature fusion approach was reconfigured, replacing the original feature pyramid structure PAN with the generalized feature fusion paradigm (GFPN). In this study, we replaced the original ordinary connection mode with a dense link. Furthermore, C2f and the EMA were integrated into the generalized feature fusion paradigm to create a new internal fusion block. Ultimately, the design of cross-scale feature fusion enables better preservation of feature information in the shallow network.

2. Materials and Methods

YOLOv8, as the state-of-the-art (SOTA) model of the YOLO family, is highly favored and frequently cited by numerous researchers for defect-detection tasks. Wang et al. [22] conducted a road defect detection task using YOLOv8 and demonstrated its effectiveness compared to other models. Cao et al. [23] enhanced YOLOv8 for the purposes of detecting defects in solar PV modules, ultimately leading to a significant improvement in the accuracy of PV defect detection. The widespread use of YOLOv8 for detection tasks in the industry can be attributed to a series of breakthroughs compared to previous models in the YOLO series. Firstly, the backbone network module C2f was redesigned to construct a richer feature representation while handling the vanishing gradient problem well by concatenating multiple residually connected DarknetBottle blocks. Secondly, YOLOv8 adopts an anchor-free design and introduces the Task Aligned Assigner sample assignment strategy [24] to simplify the model design and training process. Then, a decoupling head structure was designed for YOLOv8 to process the classification task and regression task separately, which reduces the number of network parameters and improves detection efficiency. All these facts also became the reason why we chose YOLOv8 as the original model.

This model still uses the triple architecture combination mode of backbone–neck–head. Firstly, we propose the improved backbone, which designs the C2f module that requires cross-scale fusion as the novel Par_C2f through the design pattern of parallel subnetworks and enhances the feature extraction ability. In the neck, a four-layer multi-scale attention module is introduced, where three layers are fully utilized to effectively increase the quality of feature extraction before the output of the three-layer detection head. Finally, we introduce the idea of the generalized feature fusion (GFPN) and propose a new paradigm of the CE-GFPN by combining the C2f module and the EMA module, which reconstructs the original feature pyramid PAN structure of YOLOv8. Based on these improvement strategies, we define this model as PGE-YOLO. The specific details of the model will be introduced in this section, and the improved network structure is shown in Figure 2.

2.1. Improved Backbone

After observing and analyzing the defective data, we identified a subtle difference in scale between the images of normally installed pins and violently substituted pins captured by drone photographs. Consequently, it is necessary to enhance the model’s capability to extract critical information from the backbone network.

In YOLOv8, the C2f module is designed for the backbone network, and its main method is to obtain rich feature extraction by concatenating N DarknetBottle blocks designed based on residual connections. While this residual linking approach can yield superior results compared to previous models in the YOLO family, it is challenging to effectively extract features for subtle gaps using only the simple CBS block (Conv2d+BatchNorm2d+SiLU). This limitation often leads to missed detections. Conversely, if complex super-deep modules are designed in the backbone network, the model will offer a powerful representation ability at the cost of greater training time and resource consumption. In severe cases, this can even cause the vanishing gradient problem and the exploding gradient problem. Therefore, in this paper, we designed the non-deep network ParNet [25] combined with the C2f module as the Par_C2f efficient network block and directly apply it in the cross-scale layer of the backbone network. Specifically, we reconstructed the design method of the residual block DarknetBottle in C2f and added a new ParNet block inside the residual block without further expanding the depth-structured network. This approach enriches subtle feature extraction without significantly increasing parameter and computational consumption, as shown in Figure 3.

A detailed depiction of the ParNet structure is shown in Figure 4. The approach utilizes multiple convolutional blocks of different sizes to form a parallel branching structure, which is then subjected to a batch normalization operation, and finally provides the outputs through a single 3 × 3 convolutional block. Such a reparameterization structure is well designed to reduce the delay caused by the inference process. Secondly, to address the problem of the limited receptive field caused by non-depth blocks, a skip–squeeze–excitation block (SSE) based on the squeeze–excitation (SE) [26] method was constructed in the branch structure. In this module, an output that underwent the global average pooling layer was connected in parallel via a skip connection, compensating for the performance deficits caused by the lack of depth. Finally, we define the residual block fused with ParNet as ParNetBottle and demonstrate the method’s effectiveness through experiments.

2.2. Four-Layer Architecture Multi-Scale Attention Module

In general, the defects on transmission lines are primarily small targets that are difficult to observe. When the original image is preprocessed by the model, it is first scaled to a low-resolution sample of uniform size (640 × 640). At this time, the original small target can only occupy single-digit pixels. These defective targets provide limited information for the entire image, making it extremely challenging for the model to extract key features. Therefore, our model must enhance its capacity for feature representation. It is well known that the emergence of the attention mechanism has had a profound impact on object detection, allowing the model to dynamically focus on the key information in the image. However, there are issues to be addressed with popular channel attention and spatial attention mechanisms. On the one hand, the channel attention mechanism requires additional computational cost to calculate the attention weight of each channel, increasing the computational complexity and memory consumption of the model. Moreover, the performance of the channel attention mechanism largely depends on the quality of the initial feature representation. If the initial feature representation is inadequate, the channel attention mechanism may not effectively improve model performance. On the other hand, the spatial attention mechanism may lose the spatial information of the image, especially when the attention weights are focused primarily on localized regions. This could result in a decline in model performance when handling tasks with a spatial structure. Additionally, the spatial attention mechanism may become overly reliant on local information, thus ignoring important features in other locations in the image, leading to a lack of understanding of global relationships and affecting performance. In this paper, we introduce the efficient multi-scale attention module (EMA) [27] to avoid these problems. Moreover, our four-layer architecture EMA model effectively addresses the challenges associated with extracting key information from minimal targets and establishes a robust long- and short-term dependence relationship from a global perspective.

As shown in Figure 5, above all, feature grouping is one of EMA’s special features. For any feature map, we set it as

X \in R^{c \times h \times w}

. After being input, X will be divided into g sub-features along the cross-channel direction. Each sub-feature is denoted by

X = [X_{0}, X_{i}, \dots, X_{g - 1}]

, where

X_{i} \in R^{c / / g \times h \times w}

. In this way, the feature extraction of each sub-feature

X_{i}

will be represented by the learned weights for the key information. Next, we analyze the network structure of the EMA. The EMA draws inspiration from the Coordinate Attention (CA) [28] and non-deep networks and is specifically designed to incorporate three parallel sub-structure groups. This allows for the dynamic assignment of weights to the grouped feature maps. Two of the parallel groups undergo a 1 × 1 convolution following a 1D global average pooling operation along the two respective directions, which we will refer to as the 1 × 1 branch. The third grouping directly undergoes a 3 × 3 convolution, referred to as the 3 × 3 branch. The 1D global average pooling mentioned above is defined as follows:

z_{c}^{H}

for directions along the horizontal dimension and

z_{c}^{W}

for directions along the vertical dimension.

z_{c}^{H} (H) = \frac{1}{W} \sum_{0 \leq i \leq W} x_{c} (H, i)

(1)

z_{c}^{W} (W) = \frac{1}{H} \sum_{0 \leq j \leq H} x_{c} (j, W)

(2)

After reshaping the input tensor to the batch dimension (

c / / g \times h \times w

), the features entering the 1 × 1 branch are combined through concatenation, thus avoiding any reduction in dimensionality at the channel level. The resulting output then undergoes activation through the sigmoid function along both directions. These two parallel paths are then combined through multiplication to achieve the initial cross-channel interaction of features. In the 3 × 3 branch, the 3 × 3 convolution operation is used to preserve local key information within the channel and expand the feature space. Finally, three different groups of branch structures will perform the overall cross-space feature information fusion. Specifically, the global spatial information is encoded by the 2D global average pooling operation in one output of the 1 × 1 branch and matrix multiplied with the output of the 3 × 3 branch to form the first spatial attention, which is denoted as

R_{1}^{c / / g \times 1 \times 1} \times R_{3}^{c / / g \times h \times w}

, where

R_{1}

is the output of the 1 × 1 branch and

R_{3}

is the output of the 3 × 3 branch. The 2D global average pooling is defined as follows.

z_{c} = \frac{1}{H \times W} \sum_{j}^{H} \sum_{i}^{W} x_{c} (i, j)

(3)

Similarly, the output of the 3 × 3 branch is matrix-multiplied with the output of the 1 × 1 branch after two-dimensional global average pooling to obtain the second spatial attention, expressed as

R_{3}^{c / / g \times 1 \times 1} \times R_{1}^{c / / g \times h \times w}

. Finally, the two spatial attention weight values are aggregated to obtain the global key information and address feature loss caused by long- and short-term dependencies. We designed the EMA module to be located at the neck of the network and constructed a mechanism to jointly enhance four layers of attention. The attention from the last three layers directly outputs the results to the detection head, thereby strengthening its focus on target position information and improving overall detection performance. Our experiments demonstrated that this method yields favorable results. Additionally, we also conducted multiple sets of comparison experiments with other mainstream attention mechanisms to verify the overall effectiveness of the EMA for transmission line defect detection, as depicted in Figure 6.

2.3. CE-GFPN

Research has shown that models with over one hundred layers often result in the loss of shallow network information. However, these shallow networks typically contain crucial local spatial information, particularly the relative positional relationships of different parts within an image sample. This information is essential for our research direction. In transmission lines, various defects, including grading rings and insulators, manifest as space-shifted defects. In particular, regarding the skew problem of the grading ring, the distinguishing features between normal and defect samples cannot be extracted from the appearance scale. Instead, we can only extract the gap in the relative spatial position from a global perspective. However, the original YOLOv8 only utilizes the Top-Down and Bottom-Up feature pyramid structure (PAN) [29], which does not effectively address the extraction and fusion of shallow features in images. Similarly, Tan et al. [30] proposed the Bi-directional Feature Pyramid Network (BiFPN), which achieves feature fusion by removing nodes with only one input edge and adding skip connections. This approach has shown improved performance compared to the traditional PAN. However, removing nodes may result in the loss of some useful information that cannot be preserved in the same level layer. The challenge of integrating shallow feature information into a slightly deeper network, as well as fusing deeper feature information into a super-deep network, requires resolution. To address this issue, we propose the design concept of generalized feature fusion (GFPN) [31] and incorporate the C2f module and attention reconstruction into our new paradigm, CE-GFPN. After multiple rounds of cross-scale feature fusion, shallow feature extraction is enhanced, and greater attention is paid to global information, significantly improving the model’s performance. The detailed feature fusion mode is shown in Figure 7.

In CE-GFPN, skip connections are utilized. However, to mitigate the issue of gradient vanishing caused by dense links, the dense links are designed as sparse links based on

{log}_{2} n

. This approach not only preserves and transfers information effectively, but also facilitates better fusion of shallow and deep features. Specifically, this can be expressed as follows:

P_{k}^{l} = Conv (Concat (P_{k}^{l - 2^{n}}, \dots, P_{k}^{l - 2^{1}}, P_{k}^{l - 2^{0}}))

(4)

Here,

P_{k}^{l}

represents the feature map of level k and layer l; Concat represents the cascade fusion of features; and Conv represents the 3 × 3 convolution. Additionally, in our CE-GFPN, the inner original fusion block mode is restructured. Specifically, as shown in Figure 8, we define each cascade operation as a fusion block, and within each fusion block, C2f is used for preliminary feature extraction after the three-layer feature cascade output. Finally, the attention mechanism is used for pixel-level feature extraction to retain key information before being output to another cascade block.

3. Experiments and Results

In this section, we will first introduce the dataset used in our experiments and the data augmentation methods. Next, we will present the software and hardware setup of the experiment and propose the evaluation index for transmission line defect detection. Subsequently, we will conduct ablation experiments to demonstrate the effectiveness of our proposed method, followed by comparative experiments with other mainstream detection models. Finally, we will conduct an experimental test of the model’s generalization. Detailed descriptions will be provided in the following sections.

3.1. Dataset

Due to the limited availability of public datasets on multi-target defects in transmission lines, we collected data for this experiment using unmanned aerial vehicles (UAVs) in Xinjiang Uygur Autonomous Region. A total of 1450 high-quality images depicting insulator breakage, grading ring tilt, gasket missing, and pin replacement were carefully selected and analyzed. Each image contained multiple label classes. Upon labeling, it was found that there were 328 instances of grading rings, 236 cases of damaged insulators, 1572 occurrences of normal insulators, 259 instances of missing gaskets, and 487 cases of pin replacement, totaling 2882 labels. We have named the dataset as TLMDD. For the experiment, we divided the data into 1014 training samples, 219 validation samples, and 217 testing samples with a ratio of 0.7:0.15:0.15 based on the dataset size. Additionally, to assess the model’s generalization, we utilized the insulator defect dataset CPLID, which is publicly available from the State Grid Corporation of China.

To enhance the model’s generalization and mitigate the risk of overfitting, we implemented an online data augmentation strategy during training. This involves the following steps: First, the image undergoes preliminary transformation using conventional methods such as translation, scaling, and flipping. Then, four pairs of images are randomly cropped and restitched through mosaic data augmentation. This combined data augmentation approach enriches the diversity and comprehensiveness of the dataset samples. The effect of the data augmentation strategy is illustrated in Figure 9.

3.2. Experiment Settings and Parameter Settings

The experimental hardware and software environments utilized in this paper are shown in Table 1.

The parameters of the experiments performed in this paper are shown in Table 2.

3.3. Evaluation Metrics

In the study, it was determined that the evaluation metrics would include the average precision (AP), recall (R), and precision (P) for each class and the mean average precision (mAP) for all classes. The average precision (AP) is used to measure the performance of the model in different categories, taking into account the imbalance between different categories. The mean average precision (mAP), which considers the average precision across all categories, is crucial for evaluating overall performance and serves as a key metric in this study. It is worth noting that, in this study, the mAP was evaluated using the mAP@50 and mAP@50-95. Specifically, the mAP@50 refers to the mean average precision calculated with an Intersection over Union (IoU) threshold of 0.5. The IoU refers to the intersection area of the detection box and the ground truth box divided by their union area. When the IoU exceeds 0.5, the model considers the detection box correct. In contrast, the mAP@50-95 is the average of the mAP calculated for multiple IoU thresholds. Commonly used IoU thresholds range from 0.5 to 0.95, with 0.05 as the increment. Recall (R) measures the model’s ability to correctly identify positive samples and is calculated as the ratio of true positives to actual positive examples. The precision (P) measures how accurately the model predicts samples, representing the ratio of true positives to predicted positives. The above four evaluation metrics are calculated as follows:

AP = \int_{0}^{1} P (R) d R

(5)

mAP = \frac{\int_{0}^{1} P (R) d R}{N}

(6)

R = \frac{T_{P}}{T_{P} + F_{N}} \times 100 %

(7)

P = \frac{T_{p}}{T_{P} + F_{P}} \times 100 %

(8)

In addition, we considered the number of parameters, frames per second (FPS), and GFLOPS in our comparative experiments. Implementability is crucial for industry deployment standards, and a model with a large number of parameters implies that lightweight deployment is not feasible, which is unacceptable for our purposes. Real-time object detection considers the speed of the model. FPS is a measure of the model’s ability to process image frames per second; the higher the FPS, the faster the model processes images. Similarly, in the industry, the standard for real-time target detection is generally not less than 30 FPS, and 60 FPS may be required for more demanding systems. These standards will also be used as our criteria for real-time target detection of transmission lines.

3.4. Ablation Experiments

In this paper, YOLOv8 has been improved in three distinct ways. Ablation experiments effectively evaluate the impact of different improvements and their combinations on the model. The experimental results were analyzed using the precision, recall, mAP@50, and mAP@50-95 as the evaluation metrics. Table 3 presents the precision values for each type of defect, while Table 4 shows the average indices for all defects.

From the perspective of specific defects, Table 3 demonstrates that utilizing C2f_Par to enhance the backbone network improves accuracy for all types of defects except for pins. This validates that our backbone network enhances the feature extraction capability. Furthermore, incorporating the four-layer EMA attention module resulted in a notable 4.3% enhancement in accuracy for detecting pin replacement problems with small target sizes and subtle scale changes. Additionally, there was an impressive accuracy increase of up to 8.7% for identifying grading rings, indicating that our attention mechanism effectively preserves global spatial information through cross-scale design. After reconfiguring the design of the generalized feature fusion approach, we observed maximum enhancements of 7.3% and 5.7% for the two types of spatial-type defects, insulators and grading rings, respectively. This result demonstrates that the cross-scale spatial fusion approach effectively integrates location information from shallow feature maps into the deep network.

From the comprehensive results of the model, Table 4 demonstrates that implementing three enhancements on the backbone network, neck attention, and feature fusion individually resulted in a significant improvement in the overall average accuracy of the model by 1.7%, 4.3%, and 4%, respectively. Simultaneous enhancement of the backbone network and attention mechanisms addresses the limitations in the individual accuracy improvements. Simultaneously improving the backbone network and CE-GFPN led to an average accuracy of 68.2%, surpassing the incremental improvements achieved at each stage. Ultimately, integrating all three enhancements into the initial model concurrently resulted in the highest average accuracy of 69.9%, marking a substantial increase of 6.6%. This experiment not only validates the effectiveness of each individual improvement module, but also confirms that, when the three improvements are applied simultaneously, they generate the maximum beneficial impact.

3.5. Comparative Experiment

To address our improvement points, this paper will conduct comparison experiments from multiple perspectives. First, we will compare multiple attention additions individually, and the results will be presented in Table 5. Second, we will compare mainstream feature fusion methods, and the results will be shown in Table 6. Finally, to verify the comprehensive effectiveness of the proposed method, we will conduct comparative experiments with multiple methods, including Faster RCNN, Cascade RCNN, YOLOv5n, YOLOv6n, YOLOv8n, YOLOv8s, and RT-DETR. We will select the mAP@50, mAP@50-95, parameters, GFOLPS, as well as FPS as our evaluation metrics and present the results in Table 7.

Initially, we incorporated five types of attention into the original YOLOv8 model, namely the SE, CA, NAM, ECA, and EMA. These attention mechanisms encompass a range of popular channel and spatial attention techniques currently in use. As shown in Table 5, the addition of attention led to varying degrees of improvement in model accuracy. Furthermore, the proposed four-layer efficient multi-scale attention architecture resulted in the maximum improvement. It is worth noting that the inclusion of attention has a detrimental effect on the model’s recall rate. We infer that this attention may focus on non-critical areas while neglecting the target area when extracting contextual semantic information, resulting in a decreased recall rate.

Furthermore, we utilized progressive feature fusion (AFPN) and bi-directional feature fusion (BiFPN) for feature fusion. As shown in Table 6, our generalized feature fusion approach yielded the most favorable results for defect detection in transmission lines, with a 4% increase in the precision and a 1.9% increase in the recall rate.

Finally, we compared our model with other models and analyzed the results from the perspectives of accuracy and real-time performance. Table 7 shows that the two-stage detection models Faster RCNN and Cascade RCNN not only achieved lower accuracy compared to PGE-YOLO, but also had significantly higher parameter counts at 41.4 MB and 69.2 MB, respectively. Additionally, RT-DETR, which combines the Transformer concept, did not yield ideal results for detecting defects in transmission lines, with an mAP of only 58.1%. However, our PGE-YOLO outperformed YOLOv5n, YOLOv6n, and YOLOv8n at the same parameter level. It is noteworthy that our method’s performance was 2.3% higher than that of YOLOv8s while having fewer parameters at only 27%. It is worth mentioning that, although the improvement in the neck feature fusion mode has complicated the model’s structure, PGE-YOLO’s efficiency at 81 FPS remains in the first echelon and is far higher than that of the two-stage algorithm. This aligns with the industry standard for real-time object detection (FPS greater than 60).

3.6. Generalization Experiments

The methodology of this experiment has been verified to achieve excellent results on our transmission line defect dataset. To further validate the model’s generalizability across different datasets, we will conduct experiments using the same approach on the CPLID dataset and the enhanced TLMDD.

3.6.1. CPLID

Table 8 shows that our method has improved the recall rate, mAP@50 rate, and mAP@50-95 rate by 1.9%, 1.2%, and 1.5%, respectively, compared to the initial YOLOv8 model, despite a slight decrease in the precision rate. When comparing Faster RCNN, Cascade RCNN, and YOLOv5, all indicators showed comprehensive improvement, indicating that our research methodology is effective and satisfies the requirements of generalization.

3.6.2. Enhanced TLMDD

The TLMDD sample size of our original dataset is relatively small, which potentially impacts the model’s generalization. Therefore, in this section, we expanded the dataset to 5800 samples through offline data augmentation. The distribution of the data samples is illustrated in Figure 10. Subsequently, the model’s generalization is verified through comparative experiments.

The results in Table 9 show that, on the enhanced TMLDD dataset, our method achieved improvements of 2% and 3.2% over the original models’ mAP@50 and mAP@50-95, respectively. In addition to accuracy, both the recall and precision outperformed other SOTA models. Therefore, we can conclude that PGE-YOLO demonstrates good generalization ability on large datasets.

3.7. Visualization

In this section, we conduct a visual test of PGE-YOLO in real-world transmission line scenarios and different environment scenarios. Above all, we categorize three types of real-world scenes: simple scenes, scenes with minimal targets, and complex dense scenes.

As shown in Figure 11, the baseline model YOLOv8 performed well in handling simple scenes, but demonstrated lower performance compared to PGE-YOLO. When confronted with extremely small objects, the baseline model can only detect results much lower than our proposed PGE-YOLO. Moreover, when faced with complex dense scenes, the baseline model experienced serious missed detection, while PGE-YOLO was able to maintain the integrity of detection in such cases.

To verify the robustness of the model under different conditions, we designed four weather scenarios: rainy day, dark day, strong light, and fog occlusion. As shown in Figure 12, PGE-YOLO maintained flexible detection performance in all four extreme scenarios.

4. Conclusions

We propose PGE-YOLO to address the issues of missed detection and false detection caused by large-scale gaps and small targets in transmission-line-defect-detection tasks. Firstly, we replaced the module requiring feature fusion in the backbone network with the Par_C2f module, thereby enriching the feature extraction ability by adding parallel convolution blocks. Secondly, to better focus on key information from smaller targets, we introduced and designed an efficient multi-scale attention architecture consisting of four layers. We then redesigned the GFPN to integrate shallow feature information into deeper layers through cross-scale feature fusion, significantly improving spatial defect detection accuracy. Finally, numerous ablation and comparison experiments were conducted on the TLMDD and CPLID datasets. The experimental results demonstrated that PGE-YOLO achieved superior performance with good generalization. Compared to the original YOLOv8n, PGE-YOLO improved the transmission line defect detection accuracy by 6.6% and enhanced the precision and recall rates by 1.9% and 1.5%, respectively, on the CPLID dataset. PGE-YOLO effectively balances accuracy and real-time performance requirements for transmission line defect detection tasks.

Author Contributions

Conceptualization, Z.C. and T.W.; methodology, Z.C.; software, Z.C. and W.H.; validation, Z.C. and W.H.; formal analysis, Z.C. and A.D.; investigation, Z.C. and T.W.; resources, T.W.; data curation, Z.C. and A.D.; writing—original draft preparation, Z.C. and W.H.; writing—review and editing, Z.C. and T.W.; visualization, Z.C. and W.H.; supervision, A.D. and T.W.; project administration, Z.C. and T.W.; funding acquisition, T.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Major Science and Technology Project in Xinjiang Uygur Autonomous Region (Grant No. 2022A01007).

Data Availability Statement

CPLID dataset link: https://github.com/InsulatorData/InsulatorDataSet, accessed on 19 November 2018.

Conflicts of Interest

Author T.W. was employed by the company State Grid XinJiang Electric Power Co., Ltd. The remaining authors declare that the re-search was con-ducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Li, Z.; Zhang, Y.; Wu, H.; Suzuki, S.; Namiki, A.; Wang, W. Design and application of a UAV autonomous inspection system for high-voltage power transmission lines. Remote Sens. 2023, 15, 865. [Google Scholar] [CrossRef]
Liu, X.; Miao, X.; Jiang, H.; Chen, J. Data analysis in visual power line inspection: An in-depth review of deep learning for component detection and fault diagnosis. Annu. Rev. Control 2020, 50, 253–277. [Google Scholar] [CrossRef]
Zhao, W.; Xu, M.; Cheng, X.; Zhao, Z. An insulator in transmission lines recognition and fault detection model based on improved faster RCNN. IEEE Trans. Instrum. Meas. 2021, 70, 5016408. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Liu, Y.; Huo, H.; Fang, J.; Mai, J.; Zhang, S. UAV transmission line inspection object recognition based on mask R-CNN. In Journal of Physics: Conference Series; IOP Publishing: Bristol, UK, 2019; Volume 1345, p. 062043. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Dong, C.; Zhang, K.; Xie, Z.; Shi, C. An improved cascade RCNN detection method for key components and defects of transmission lines. IET Gener. Transm. Distrib. 2023, 17, 4277–4292. [Google Scholar] [CrossRef]
Cai, Z.; Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6154–6162. [Google Scholar]
Liu, Z.; Hu, H.; Lin, Y.; Yao, Z.; Xie, Z.; Wei, Y.; Ning, J.; Cao, Y.; Zhang, Z.; Dong, L.; et al. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12009–12019. [Google Scholar]
Pang, J.; Chen, K.; Shi, J.; Feng, H.; Ouyang, W.; Lin, D. Libra r-cnn: Towards balanced learning for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 821–830. [Google Scholar]
Terven, J.; Cordova-Esparza, D. A comprehensive review of YOLO: From YOLOv1 to YOLOv8 and beyond. arXiv 2023, arXiv:2304.00501. [Google Scholar]
Liu, Z.; Wu, G.; He, W.; Fan, F.; Ye, X. Key target and defect detection of high-voltage power transmission lines with deep learning. Int. J. Electr. Power Energy Syst. 2022, 142, 108277. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Song, J.; Qin, X.; Lei, J.; Zhang, J.; Wang, Y.; Zeng, Y. A fault detection method for transmission line components based on synthetic dataset and improved YOLOv5. Int. J. Electr. Power Energy Syst. 2024, 157, 109852. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Zheng, J.; Wu, H.; Zhang, H.; Wang, Z.; Xu, W. Insulator-defect detection algorithm based on improved YOLOv7. Sensors 2022, 22, 8801. [Google Scholar] [CrossRef] [PubMed]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Gevorgyan, Z. SIoU loss: More powerful learning for bounding box regression. arXiv 2022, arXiv:2205.12740. [Google Scholar]
Wu, Y.; Liao, T.; Chen, F.; Zeng, H.; Ouyang, S.; Guan, J. Overhead Power Line Damage Detection: An Innovative Approach Using Enhanced YOLOv8. Electronics 2024, 13, 739. [Google Scholar] [CrossRef]
Li, H.; Li, J.; Wei, H.; Liu, Z.; Zhan, Z.; Ren, Q. Slim-neck by GSConv: A better design paradigm of detector architectures for autonomous vehicles. arXiv 2022, arXiv:2206.02424. [Google Scholar]
Wang, X.; Gao, H.; Jia, Z.; Li, Z. BL-YOLOv8: An improved road defect detection model based on YOLOv8. Sensors 2023, 23, 8361. [Google Scholar] [CrossRef]
Cao, Y.; Pang, D.; Zhao, Q.; Yan, Y.; Jiang, Y.; Tian, C.; Wang, F.; Li, J. Improved YOLOv8-GD deep learning model for defect detection in electroluminescence images of solar photovoltaic modules. Eng. Appl. Artif. Intell. 2024, 131, 107866. [Google Scholar] [CrossRef]
Feng, C.; Zhong, Y.; Gao, Y.; Scott, M.R.; Huang, W. Tood: Task-aligned one-stage object detection. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 3490–3499. [Google Scholar]
Goyal, A.; Bochkovskiy, A.; Deng, J.; Koltun, V. Non-deep networks. Adv. Neural Inf. Process. Syst. 2022, 35, 6789–6801. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Ouyang, D.; He, S.; Zhang, G.; Luo, M.; Guo, H.; Zhan, J.; Huang, Z. Efficient multi-scale attention module with cross-spatial learning. In Proceedings of the ICASSP 2023–2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 13713–13722. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 10781–10790. [Google Scholar]
Jiang, Y.; Tan, Z.; Wang, J.; Sun, X.; Lin, M.; Li, H. Giraffedet: A heavy-neck paradigm for object detection. arXiv 2022, arXiv:2202.04256. [Google Scholar]

Figure 1. The overall detection process.

Figure 2. The network structure diagram of our PGE-YOLO algorithm. Our model continues to follow the three-layer design pattern of backbone+neck+head. The SPPF module represents an enhanced version of Spatial Pyramid Pooling.

Figure 3. The structure of C2f (a) and Par_C2f (b).

Figure 4. The structure of ParNet. The SSE module represents the skip–squeeze–excitation block.

Figure 5. The structure of the EMA. After reshaping the channel into the batch dimension, three parallel substructure groups were formed, two 1 × 1 branches and one 3 × 3 branch.

Figure 6. Visualization of the findings from multiple comparative experiments. (a) refers to the comparison of attention mechanism improvement, while (b) refers to the comparison of feature fusion improvement.

Figure 7. The structure of PAN (a), BiFPN (b), and GFPN (c).

Figure 8. A graph of the independent structure of the internal fusion block.

Figure 9. The rendering after data augmentation.

Figure 10. Expanded TLMDD dataset labels’ distribution map.

Figure 11. Visualization tests in three real scenarios: simple scenes (a), minimal target scenes (b), and complex dense scenes (c). Ground Truth represents labels of real defects; Base represents detection results using YOLOv8.

Figure 12. Visualized results in different weather environments.

Table 1. Software and hardware setting.

Item	Value
CPU	18 vCPU AMD EPYC 9754 128-Core Processor
GPU	NVIDIA GeForce RTX 4090D(24GB)
Python	3.9
Pytorch	2.0
CUDA	11.8

Table 2. Experimental parameter setting.

Parameter	Value
Epochs	300
Batch size	32
Workers	8
Optimizer	SGD
Learning rate	0.01
Momentum	0.973

Table 3. AP metric results for each category.

C2f_Par	EMA	CE- GFPN	Insulator	Insulator Defect	Ring Defect	Gasket Missing	Pin Defect
-	-	-	91.6	76.4	53.4	69.8	25.2
√	-	-	94.1	79.2	57.3	72.8	21.5
-	√	-	93.7	80.8	62.1	72.1	29.5
-	-	√	92.8	83.7	59.1	72.2	28.7
√	√	-	93.3	81.2	62.9	72.9	28.4
√	-	√	93.2	82.2	60.1	71.9	33.4
√	√	√	93	82.3	74.5	73.1	26.3

Table 4. mAP metric results for all categories.

C2f_Par	EMA	CE-GFPN	mAP@50	mAP@50-95
-	-	-	63.3	38.1
√	-	-	65	41
-	√	-	67.6	42.9
-	-	√	67.3	43.3
√	√	-	67.7	43.1
√	-	√	68.2	44.9
√	√	√	69.9	45.7

Table 5. The results of the attention contrast experiment.

Method	P	R	mAP@50	mAP@50-95
Base	63.2	66.3	63.3	38.1
Base + SE	58.6	66.4	63.9	40.5
Base + CA	59.9	65.1	64.1	41.4
Base + NAM	62.5	63.8	64.8	42.7
Base + ECA	63.9	65.2	65.6	43.1
Base + EMA	67.5	66.6	67.6	42.9

Table 6. The results of the feature fusion comparison experiment.

Method	P	R	mAP@50	mAP@50-95
Base + PAN	63.2	66.3	63.3	38.1
Base + AFPN	63.5	66.5	64.6	45
Base + BiFPN	64.2	63.9	63.7	37.3
Base + GFPN	63.1	68.2	67.3	43.3

Table 7. Comparison experiment results between PGE-YOLO and SOTA models.

Method	mAP@50	mAP@50-95	Parameters (MB)	GFLOPs (G)	FPS
Faster-RCNN	59.7	37.5	41.4	70.1	38
Cascade-RCNN	65.9	40.5	69.2	98.6	33
YOLOv5n	63	38.2	2.5	7.1	90
YOLOv6n	62.2	38.4	4.2	11.8	99
YOLOv8n	63.3	38.1	3.1	8.1	86
YOLOv8s	67.6	44.7	11	28.4	80
RT-DETR	58.1	37.9	32	103.5	31
PGE-YOLO	69.9	45.7	3.5	9.1	81

Table 8. Generalization experiment results on the CPLID dataset.

Metrics	Faster RCNN	Cascade RCNN	YOLOv5	YOLOv8	PGE-YOLO
P	85.5	93.1	98.5	98.5	97.1
R	86.9	96.9	96.3	95.8	97.9
mAP@50	88.4	97.3	98.3	97.8	99
mAP@50-95	59.1	73.7	79.9	80	81.5

Table 9. Generalization experiment results on the enhanced TLMDD dataset.

Metrics	Faster RCNN	Cascade RCNN	YOLOv5	YOLOv8	PGE-YOLO
P	65.7	71.2	77.9	79	83.9
R	55.7	60.2	78	80	81.2
mAP@50	71.9	80.2	82	84.2	86.2
mAP@50-95	42.3	52	57.7	57.8	61

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cai, Z.; Wang, T.; Han, W.; Ding, A. PGE-YOLO: A Multi-Fault-Detection Method for Transmission Lines Based on Cross-Scale Feature Fusion. Electronics 2024, 13, 2738. https://doi.org/10.3390/electronics13142738

AMA Style

Cai Z, Wang T, Han W, Ding A. PGE-YOLO: A Multi-Fault-Detection Method for Transmission Lines Based on Cross-Scale Feature Fusion. Electronics. 2024; 13(14):2738. https://doi.org/10.3390/electronics13142738

Chicago/Turabian Style

Cai, Zixuan, Tianjun Wang, Weiyu Han, and Anan Ding. 2024. "PGE-YOLO: A Multi-Fault-Detection Method for Transmission Lines Based on Cross-Scale Feature Fusion" Electronics 13, no. 14: 2738. https://doi.org/10.3390/electronics13142738

APA Style

Cai, Z., Wang, T., Han, W., & Ding, A. (2024). PGE-YOLO: A Multi-Fault-Detection Method for Transmission Lines Based on Cross-Scale Feature Fusion. Electronics, 13(14), 2738. https://doi.org/10.3390/electronics13142738

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

PGE-YOLO: A Multi-Fault-Detection Method for Transmission Lines Based on Cross-Scale Feature Fusion

Abstract

1. Introduction

2. Materials and Methods

2.1. Improved Backbone

2.2. Four-Layer Architecture Multi-Scale Attention Module

2.3. CE-GFPN

3. Experiments and Results

3.1. Dataset

3.2. Experiment Settings and Parameter Settings

3.3. Evaluation Metrics

3.4. Ablation Experiments

3.5. Comparative Experiment

3.6. Generalization Experiments

3.6.1. CPLID

3.6.2. Enhanced TLMDD

3.7. Visualization

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI