1. Introduction
Transmission lines often need to cross mountains and rivers and are primarily distributed in complex terrains such as mountains and hills and harsh climate environments, significantly complicating line inspections. Insulators have been operating in harsh and complex environments such as strong electric fields, high-temperature sunlight, and mechanical stress for a long time. When their degradation reaches a certain level, their insulation performance will decrease [
1,
2,
3,
4]. Especially on high-voltage transmission lines, the deterioration of insulators directly threatens the safe operation of power systems. In order to ensure the stable and safe operation of the power grid, the defect detection of power insulators has become an essential task in the surveillance of power systems. Therefore, it is significant to study insulator image recognition and defect detection methods [
5]. In recent years, with the maturity of deep learning and the limitation of image recognition technology, an earlier insulator defect detection method used the edge recognition feature method to extract the insulator’s shape. Then it used the elliptic equation to fit the contour and finally determined the missing parts of the insulator using the analysis and counting method. There is also a method of using the image filtering algorithm to intercept the insulator’s edge to realize insulator image recognition. The above method could be more computationally complex and efficient. With the development of artificial intelligence technology, computer vision has been gradually applied to detect defects in transmission line insulators.
The current deep learning-based target detection recognition algorithms are broadly classified into two categories. One class is the target detection algorithm based on region suggestion, such as the representative algorithms: recurrent neural networks (R-CNN) [
6], Fast R-CNN [
7], Faster R-CNN [
8], spatial pyramid pooling networks (SPP-Net) [
9], and so on. Another category is regression-based target detection and recognition algorithms such as the single shot multibox detector (SSD) [
10], and the YOLOv2 [
11], YOLOv3 [
12], YOLOv4 [
13], etc., in the you-only-look-once (YOLO) [
14] series.
Wang et al. [
15] proposed a transmission line ice thickness identification method based on a combination of a MobileNet v3 lightweight feature extraction network and an SSD detection network, with an accuracy of 74.5%. Zhao et al. [
16] constructed an automatic defect detection model called the automatic visual shape clustering network (AVSCNet) to detect missing parts of bolts, and its detection accuracy can reach up to 87.6%. Davari et al. [
17] used Faster RCNN to detect defects on distribution lines in each frame of UV–visible video, then identified corona discharges on the lines by color thresholding, and finally described the severity of the occurrence of faults by the ratio of the area of spots to the area of defects. Rong et al. [
18] applied Faster RCNN, Hough transform, and advanced stereovision (SV) to detect vegetation encroachment on power transmission lines, converting two-dimensional (2D) images of vegetation and power transmission lines into three-dimensional (3D) height and location results to obtain accurate identification and localization. Feng et al. [
19] proposed a YOLOv5 target detection model based on the method for the automatic detection of insulator defects. Compared with four different versions of the YOLOv5 model, the YOLOv5x model based on k-mean clustering can effectively identify and locate insulator defects in transmission lines. However, the maximum accuracy is only 86.8%. Liu et al. [
20] proposed the MTI-YOLO network model, which uses a multi-scale detection head and multi-scale feature fusion structure to improve the model’s detection accuracy, but only detects defects of common insulators. Wu et al. [
21] proposed a CenterNet-based insulator defect detection method, which simplified the backbone network and applied an attention mechanism in the model to suppress useless information, improving the accuracy of network detection. However, the detection speed is not high, and when two different defect classes share the same centroid, CenterNet can only detect one of the defect types. Qiu et al. [
22] proposed an improved YOLOv4-based algorithm for defect detection of insulators. A GraphCut image enhancement method was used, and then the images were processed by sharpening to recreate a dataset. A MobileNet lightweight network is used to fuse with the YOLOv4 model structure. Tao et al. [
23] proposed a new convolutional neural network cascade architecture for performing defect localization and detection of insulators. This network uses a CNN based on a region suggestion network to transform defect detection into a two-level object detection problem. Wang et al. [
24] proposed an insulator defect detection method based on the improved ResNeSt and region proposal network (RPN). A new network based on ResNeSt was first built, and then the improved RPN was added to the improved network for feature extraction to better detect minor defects on insulators.
Insulators are diverse and complex, and traditional manual inspection is inefficient and prone to leakage and misdetection. Therefore, it is especially important to research a method for insulator defect detection. This article focuses on the problem of insulator defects in power systems. It applies deep learning-based target detection to insulator defect detection, which is significant in improving the inspection intelligence of power systems.
The YOLO series algorithm is a deep learning neural network image recognition algorithm. The YOLOv5 algorithm, as a representative of a single-stage detection algorithm, has the advantages of a small amount of code, simple procedure, fast detection speed, and high detection accuracy. It is the closest image recognition algorithm to engineering technology. In this paper, the YOLOv5 algorithm is improved for insulator image and defective insulator detection. Firstly, the insulator defect image samples are re-clustered using the k-means clustering algorithm to obtain different-sized a priori frame parameters; the attention mechanism NAM is added to the feature extraction module of the YOLOv5 algorithm, and the NAM attention is redesigned for the channel and space attention sub-module. The contribution factor in weights is used to improve the attention mechanism. Furthermore, batch-normalized scale factors are used, and the standard deviation is used to indicate the weight importance; a gnConv recursive gated convolution is used instead of the typical convolution module of the neck network in the YOLOv5 model. gnConv achieves the modeling of higher-order spatial interactions. It has the features of high performance, scalability, and translation invariance. The improved algorithm is used to enhance the network feature extraction capability, as well as the efficiency of feature fusion, to improve YOLOv5s detection performance. The proposed method achieves a good balance between accuracy and speed, and the performance fully meets the demand of power inspection for insulator online localization.
2. Related Work
The YOLO algorithm is a target detection algorithm proposed by Joseph Redmon [
14], and YOLOv5 uses the most optimized strategy in recent years based on the original YOLO target detection framework, resulting in great performance improvement in both speed and accuracy. The YOLOv5 algorithm has four versions: YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x. The difference between them lies in the depth and width settings of the model. The deeper the backbone network is, the more feature maps are obtained, and deeper networks imply more complexity. Among them, YOLOV5s is the network with the smallest depth and the smallest width of the feature map, but it is not true that the larger the model is, the better the detection accuracy obtained will be. It needs to be practical. We can usually divide the YOLO model into four general modules: the input, the benchmark network backbone, the neck network, and the head.
As shown in
Figure 1, the network structure of YOLOv5 is as follows: the input includes mosaic enhancement, adaptive anchor frame calculation, adaptive image scaling, and finally, conversion of the image into a 640*640*3 tensor input network. The backbone is mainly used for feature extraction of the input image, firstly through the slice operation of the conv+bn+silu (CBS) layer, secondly, through the convolution module’s role of downsampling for feature extraction, then through C3 (BottleneckCSP) and the Conv operation, which obtain the feature map, and finally, the improvement of the accuracy through the spatial pyramid pooling (SPPF) module. The neck network is usually located in the middle of the backbone network and the head network. Using the neck network can better utilize the features extracted from the backbone network to achieve multi-scale prediction through the feature pyramid network (FPN) and perceptual confrontation network (PAN) structures for upsampling and downsampling processes [
25]. The head output layer mainly uses the previously extracted features to make predictions and complete the output of target detection results. This article will describe an improvement of the YOLOv5s network that improves the accuracy of insulator defect detection.
3. Methodology
3.1. K-Means Algorithm for Re-Clustering Anchor Frames
In general, in anchor-based target detection algorithms, most of the anchors are designed by hand. For example, in the classical algorithm models SSD and Faster-RCNN, nine anchors of different sizes and aspect ratios are designed, respectively. However, there is a disadvantage, in that the manually designed anchors are not guaranteed to be well-suited to different datasets. If the size of the anchor design and the size of the target in the dataset are relatively different, the detection effect of the model will be affected. For YOLOv2, Joseph Redmon suggests using k-means clustering instead of manual design, by clustering the bounding box of the training set and automatically generating a set of anchors that are more suitable for the dataset, which can make the network detect better. Since the default anchor box of the YOLOv5 algorithm saves the preset anchor box for the COCO dataset and insulator defect detection, the image size and detection target of the COCO dataset do not match with this dataset. We use the k-means clustering method to recalculate the anchor box that matches the labeled box of this dataset. The k-means clustering method is performed mainly by calculating the distance (similarity) between samples to cluster the closer samples into the same class (cluster) [
26]. The k-means algorithm’s primary process is as follows:
- Step 1:
Initialize K cluster centers (assume K = 2).
- Step 2:
Randomly select k samples among all samples as the initial centers of clusters, as shown in
Figure 2a, where the two black solid dots represent the two cluster centers randomly initialized.
- Step 3:
Calculate the distance of each sample from the center of each cluster (Euclidean distance), and then divide the sample into the clusters closest to it. Different colors are used to distinguish different clusters, as in
Figure 2b.
- Step 4:
Update the cluster centers and calculate the mean value of all samples in each cluster as the new cluster centers. As shown in
Figure 2c, the two blue solid points have moved to the centers of the corresponding clusters.
- Step 5:
Repeat Step 3 to Step 4 until the cluster centers are not changing or the cluster centers are changing very little to satisfy the given termination condition. The final clustering results are shown in
Figure 2d.
Figure 2.
K-means clustering. (a) Randomly initialize two cluster centers; (b) Calculate the Euclidean distance between the sample and the cluster center. (c) Update the cluster centers. (d) Final clustering result.
Figure 2.
K-means clustering. (a) Randomly initialize two cluster centers; (b) Calculate the Euclidean distance between the sample and the cluster center. (c) Update the cluster centers. (d) Final clustering result.
For the k-means algorithm to re-cluster the anchor box, usually, the bounding box is represented by the top-left vertex and the bottom-right vertex, i.e., (x1, y1, x2, y2). When doing clustering on the box, we only need the width and height of the box as features and also need to perform normalization on the width and height of the box using the width and height of the image first, i.e.,
,
; if we directly use the Euclidean distance as a metric in the standard k-means algorithm, then the large box clusters will generate more error than the small box clusters in the clustering results. Since we only care about the intersection over union (IOU) of the anchor and box and do not care about the box size, it is more appropriate to use the IOU as a metric, as shown in
Figure 3.
Suppose we have
,
, then we have (1):
We do not care about the box’s position when calculating the IOU here. We assume all boxes’ upper left vertices are at the origin. Obviously, the value of the IOU is between 0 and 1. If two boxes are more similar, their IOU values are more significant. The more similar the two boxes are, the closer they should be, so the final metric is shown in (2).
Steps to perform k-means on a box
- Step 1:
Random initialization: select K boxes as the initial anchor.
- Step 2:
Using the IOU metric, assign each box to the anchor that is closest to it.
- Step 3:
Calculate the average width and height of all the boxes in each cluster; update the anchor.
- Step 4:
Repeat Steps 2 and 3 until the anchor no longer changes, or the maximum number of iterations is reached.
A new set of anchor frames was obtained by clustering the anchor frames of the dataset used in this paper. The new anchor frames were sized more closely to the locations of insulator defects in the dataset, making the actual detection results more consistent with the task requirements.
3.2. Normalization-Based Attention Mechanism (NAM)
The NAM serves as an efficient and lightweight attention mechanism [
27]. Compared with other attention mechanisms, it does not require additional computations and parameters such as full connectivity, convolution, etc. It adopts the modular integration of CBAM and redesigns the attention submodules of channel and space.
For the channel attention submodule, the scale factor in batch normalization (BN) is used to measure the variance of the channels and indicate their importance, as shown in (3).
and
are the mean and variance, respectively, of the small batch and are the trainable affine transformation parameters. The channel sub-attention module is shown in
Figure 4 and (4).
where
represents the output features,
is the scaling factor of each channel, and
is the input feature with the weight
, so we can obtain the weights of each channel.
For its spatial attention submodule, the scale factor of BN is applied to the spatial dimension to measure the importance of pixels. It is called pixel normalization. The corresponding spatial attention submodule is shown in
Figure 5 and (5).
where the output features are denoted as . is the scaling factor. The weights are . To suppress the less significant weights, we add a regularization term to the loss function, as shown in (6), where x denotes the input; y is the output; W represents the network weigh; l(·) is the loss function; g(·) is the parametric penalty function, and is the balance of and of the penalty. The loss value is denoted as (6).
The effectiveness of the NAM attention mechanism varies with the additional position, and the depth of the module affects the insertion position of the attention. In this paper, the NAM module is integrated into the YOLOv5s model. Through continuous network training, the module is eventually inserted into the neck part of the upsampled position. The core of this is to use multiple pools of different sizes to sample the obtained feature maps (pooling) for feature extraction. We add the NAM attention mechanism behind this, which can reduce the weight of less essential features. This approach applies a sparse weight penalty to the attention module, which makes the computation of these weights more efficient. With this method, channel and spatial dimensions can be better integrated to reacquire channel features and spatial location information of insulator defect images, which not only extracts useful essential location information of insulator defects but also identifies the class of defects more quickly and accurately. It helps to improve the accuracy of network detection.
3.3. Recursive Gated Convolutions (gnConv)
g
nConv is an efficient operation for performing higher-order spatial interactions based on gated convolution and recursive designs [
28]. g
nConv is constructed using standard convolution, linear projection, and elemental multiplication, and the new operation is highly flexible and customizable, has input-adaptive spatial mixing similar to self-attentiveness, and does not add a large amount of computation.
Figure 6 shows the structure of the g
nConv recursive gated convolution.
3.3.1. Gated Convolution-Based Input Adaptive Interaction
We seek to perform spatial interactions more efficiently and effectively through some simple convolutional and fully connected layer operations.
The basic operation of this method is gated convolution (gConv). Let be the input feature: the output of the gated convolution can be expressed as (7) and (8).
The input features are first linearly projected to give and . and are linear projection layers to perform channel blending. f is the depth convolution, , where is the local window centered on , and represents the weights of the convolution of . The above equation introduces the neighboring features and by multiplying between elements . After deep convolution, the dot product operation is performed with to obtain after another linear projection to obtain the output , at which time the first-order spatial interaction is extracted.
3.3.2. Higher-Order Interactions for Recursive Gated Convolution
After achieving an effective first-order spatial interaction with gConv, a recursive gated convolution-gnConv is designed to further enhance the capacity of the algorithmic model by introducing higher-order interactions.
We start with after a higher-order linear projection to obtain the features and as shown in (9). Then, recursively executing the gated convolution, we can sequentially obtain as shown in (10). Here the output is scaled to for stable training. is a set of deep convolutional layers, and is used for dimensional alignment at different orders of, as shown in (11).
As shown in (10), the interaction order of increases by 1 for each recursive order, and the output of the last recursive step is input to the projection layer to obtain the result of gnConv, so gnConv can perform spatial interaction of n orders.
To ensure that higher-order interactions do not introduce a tremendous computational effort, the channel dimension of each order is set to (12); (12) indicates that higher-order spatial interactions will be performed in a coarse-to-acceptable manner, where lower orders are computed with fewer channels.
gnConv is not designed to mimic only the aspect of self-attention. It has the following three advantages: (1) Simplicity and efficiency: The convolution-based implementation avoids the secondary complexity of self-attentiveness. Performing spatial interactions progressively increases the channel width, allowing the model to achieve higher-order interactions with limited complexity; (2) Scalability: We use the second-order interactions in self-attention to extend to arbitrary orders, further improving the detection capability of the model; (3) gnConv fully inherits the translational equivariance of standard convolution, introducing a functional inductive bias for current defect detection tasks and avoiding the asymmetry caused by local attention.
Because the neck part of the YOLOv5 model is a further depth to the feature extraction, we applied the g
nConv recursive gate convolution to the neck part, replacing the original ordinary standard convolution, and obtained the best results through experiments, reflecting the efficiency of accurate recognition of defects in insulators.
Figure 7 shows the network structure of the improved optimal model.