**1. Introduction**

The main function of a power line vibration damper is to reduce the vibration of the wire caused by wind galloping. High-voltage transmission towers have large spacing, which makes it easy for the wires to vibrate when subjected to wind. The periodic bending of the suspension caused by the vibration of the wire leads to fatigue damage to the metal wire. In severe cases, accidents such as wire breakage and power tower collapse will be induced. The use of a vibration damper on high voltage transmission lines can reduce the vibration of the wires caused by the wind, thereby reducing the probability of accidents. Therefore, vibration damper detection is an important topic in the inspection of overhead transmission lines [1]. Vibration damper detection refers to obtaining the specific position of the vibration damper in the inspection image. This task is an important prerequisite for the work of vibration damper displacement detection, damage detection, and corrosion detection. At present, vibration damper detection has attracted the attention of researchers in the fields of smart grid and machine vision, with certain progress made [2].

UAV technology has developed rapidly in recent years. UAV has the advantages of convenient operation, easy portability, and low cost [3]. Multi-UAV systems based on wireless sensor networks [4] are used in crop yield estimation [5], object detection [6], and other fields. UAVs have rapidly developed into important auxiliary equipment.

At present, the inspection of overhead transmission lines still mainly relies on visual inspection by staff, which can produce omissions and incorrect judgments for the vibration damper located at a high place; therefore, the use of UAVs for transmission line inspection

**Citation:** Chen, W.; Li, Y.; Zhao, Z. Transmission Line Vibration Damper Detection Using Deep Neural Networks Based on UAV Remote Sensing Image. *Sensors* **2022**, *22*, 1892. https://doi.org/10.3390/s22051892

Academic Editors: Moulay A. Akhloufi and Mozhdeh Shahbazi

Received: 30 January 2022 Accepted: 26 February 2022 Published: 28 February 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

is an issue of great research value. Researchers have used UAVs for equipment detection and other tasks [7,8]. This article focuses on the issue of vibration damper detection using UAV aerial images.

In early research work, traditional image processing techniques were most widely used in power line inspection scenarios [2,9]. Researchers would select appropriate feature extraction operators according to the actual situation and complete the task of object detection through a threshold setting. Machine learning algorithms were also selected to achieve better detection results [10]. However, such methods are very susceptible to interference from background information, especially when using UAV aerial photography data, as the similar color properties of vibration dampers and power towers can easily cause missed detection.

In recent years, with the exponential growth of machine computing power and data volume, it has become a research hotspot again. Deep learning technologies, especially convolutional neural networks, have opened new research directions in the field of computer vision. There is much research on power components using the state-of-the-art method in the field of object detection [11,12]. However, at present, these works are mainly based on the simple application of the framework, there is no targeted improvement for the characteristics of the vibration damper, and high accuracy of the model requires a large amount of computing resources.

In addition, some studies have used special equipment for imaging or for the physical properties of the device [13,14]. The results of these works are usually excellent, but the extra equipment overhead and high usage cost make such methods unsuitable for power line patrol scenarios.

Aiming at the research status of image-based vibration damper detection, this article proposes a vibration damper detection model based on the one-stage algorithm in target detection. The main contributions of this paper are as follows:


The remainder of this article is organized as follows. Section 2 briefly introduces the related work of vibration damper detection. Section 3 introduces the basic framework used in the method proposed in this article. In Section 4, this article introduces the details of DamperYOLO. In Section 5, this article introduces the damper dataset, the experimental details, and a series of comparative experiments. Section 6 provides a brief summary of the work.

#### **2. Related Work**

This section focuses on the image-based vibration damper detection research. The existing work is mainly divided into traditional image processing methods, deep learningbased research, and detection methods based on auxiliary equipment.

#### *2.1. Traditional Method*

Traditional image processing algorithms use edge detection, color space conversion, and clustering algorithms to extract damper information in images, usually combined with machine learning algorithms for iterative classification tasks.

Wu et al. [2] used the snake model to extract the edge of the vibration damper, but due to the helicopter airborne imaging equipment required, the cruise cost was high. Huang et al. [9] performed corrosion and displacement detection on the vibration damper based on rusty area ratio and color shade index, involving grayscale processing, edge detection, threshold segmentation, morphological processing, and other technologies. Similarly, Song et al. [15] detected the rust problem of the vibration damper based on the histogram. Jin et al. [10] used Harr-like features and a cascade adaboost classifier to classify and detect vibration dampers on overhead lines. Yang et al. [16] performed exponential transformation on the S and V components in the HSV color space to improve the contrast between the front and background. Liu et al. [17] used the canny operator and Hough transform method to detect the displacement of the vibration damper on the high-voltage line. Similarly, Chen et al. [18] used random Hough transformation for the vibration damper detection task. Miao et al. [19] used the wavelet modules maximum method to locate the shock hammer on the transmission line. Pan et al. [20] used a simple extraction operator to monitor the state of the vibration damper. Jin et al. [21] used the Adaboost algorithm to conduct real-time monitoring of the line vibration damper through drones.

Traditional methods use operators and classifiers to identify the vibration damper on the line; the detection accuracy is limited by the complexity of the environmental background, but its advantage lies in its fast detection speed, which is suitable for realtime detection.

#### *2.2. Deep Neural Networks*

With the rapid development of deep learning technology, the detection of power line components based on neural networks such as CNNs has gradually become a popular research direction.

Based on YOLOv4, Bao et al. [1] used k-means to analyze the aspect ratio of the anchor to detect damage, corrosion, and displacement faults of the vibration damper. Zhang et al. [11] also used Faster R-CNN to detect damage and corrosion defects of the vibration damper twice, in which the first detection result was used as the second proposal, thereby improving the detection effect. Bao et al. [12] used the Cascade R-CNN framework to locate and detect the damage of the vibration damper. Yang et al. [16] performed the detection task of vibration dampers using Faster R-CNN based on HSV color space transformed images. Guo et al. [22] used YOLOv4 to improve the detection effect of damaged vibration damper. Wang et al. [23] investigated insulator defects in overhead transmission lines, damage to vibration dampers, and foreign objects in bird's nests. Zhang et al. [24] switched to VGG16 as the basic backbone network and performed detection tasks for shockproof hammers and other foreign objects on power towers.

The detection of power line components and foreign objects using deep neural networks has also attracted the attention of researchers. For example, the YOLO framework is used to detect insulators on transmission lines [25] and icing detection [26], change the anchor setting of Faster R-CNN according to the shape characteristics of the insulator [27], using Mask R-CNN to detect line foreign objects [28], defect detection for high-speed rail catenary insulators [29], and detection of wet insulators using infrared images [30]. Usually, these studies are only simple applications of power components datasets, and most of the studies lack targeted transformation for specific environments and scenarios; the solutions provided are mostly trick stacking.

#### *2.3. Auxiliary Equipment*

In addition to using common optical images, there is also research that uses other imaging equipment and auxiliary devices to perform detection tasks. For example, a robot is used to reset the vibration damper [14,17,31], and the damage of the vibration damper is detected based on LiDAR data [13]. The damping of the vibration damper is detected based on sensors such as optical ground wire (OPGW) and an all-dielectric self-supporting (ADSS) optical cable [32]. In addition, some researchers [33] designed a rotation-free spacer damper to improve the anti-galloping ability of power lines.

#### *2.4. Researches Summary*

There is still room for improvement in the detection of vibration dampers for overhead transmission lines. A summary of these research is as follows:


Combining the characteristics of the abovementioned research work, we not only hope to obtain excellent detection results, but also hope that the model can run in real time on devices lacking computing resources, such as drones. A one-stage method using deep neural network is the most suitable choice. One-stage object detection utilizes the powerful feature extraction capabilities of CNNs to cope with complex application scenarios. At the same time, the detection result does not depend on the proposal, and its calculation speed is fast enough. Therefore, in the following work, based on the one-stage model, we propose a detection method based on the visual characteristics of the vibration damper in the real scene.

#### **3. Basic Knowledge of YOLO**

YOLO [34] proposed by Redmon et al. in 2016 is a classic one-stage object detection method. YOLOs [34–37] solves the target detection problem as a regression problem. After an inference of the input image, the positions of all objects in the image, their categories, and corresponding confidence probabilities can be obtained. YOLO divides the input image into SxS grids, and each grid is responsible for detecting objects that fall into the grid. If the coordinates of the center position of an object fall into a grid, then the grid is responsible for detecting the object.

The difference between the backbone in YOLOv4 [37] is that it is based on the Darknet structure in YOLOv3 [36] and borrows the structure of the CSPNet network [38] to propose a network structure called CSPDarknet. The loss function used in training is CIOU [39].

Since the objects to be detected in this paper are only vibration dampers, an overly complex network structure will have a negative impact on feature extraction; therefore, this paper selects the classic ResNet101 [40] as feature extraction network. The objective function of YOLOv4 is as follows:

$$L\_{dct} = L\_{box} + L\_{obj} + L\_{cls} \tag{1}$$

where *Lbox*, *Lobj*, and *Lcls* represent the regression loss, confidence loss, and category loss of the box, respectively. The expression of the box regression loss is as follows:

$$L\_{\text{box}} = \lambda\_{\text{coord}} \sum\_{i=0}^{S^2} \sum\_{j=0}^{B} \mathbf{1}\_{i,j}^{obj} \left( 1 - \left( IoI - \frac{Distance\\_2^2}{Distance\\_C^2} - \frac{v^2}{(1 - IoI) + v} \right) \right) \tag{2}$$

where *λcoord* is the weight of box regression loss, *S*<sup>2</sup> *<sup>i</sup>* represents the *i*th grid of *S*×*S* size, *Bj* represents the *j*th predicted box of *S*<sup>2</sup> *<sup>i</sup>* , and 1*obj <sup>i</sup>*,*<sup>j</sup>* indicates that there is a target center of the prediction category in the box. *IoU* is the Intersection-of-Union of the predicted box and ground truth, the calculation formula of *IoU* is Equation (3), *Distance*\_2 is the Euclidean distance between the center coordinates of *Box<sup>p</sup>* and *Boxgt*, *Distance*\_*C* is the diagonal length of the smallest bounding rectangle of *Box<sup>p</sup>* and *Boxgt*, *v* is a parameter to measure the consistency of the aspect ratio of *Box<sup>p</sup>* and *Boxgt*, and the calculation formula of *v* is Equation (4).

$$IoU = \frac{\left| Box^p \cap Box^{\ $t} \right|}{\left| Box^p \cup Box^{\$ t} \right|} \tag{3}$$

where *Box<sup>p</sup>* and *Boxgt* represent the predicted box and ground truth, respectively.

$$w = \frac{4}{\pi^2} \left( \arctan \frac{w^{\mathbb{g}^t}}{h^{\mathbb{g}^t}} - \arctan \frac{w^p}{h^p} \right)^2 \tag{4}$$

where *wgt* and *w<sup>p</sup>* represent the width of the ground truth and predicted box, respectively, while *hgt* and *h<sup>p</sup>* represent their respective heights.

Similar to the regression loss, the loss function for the target prediction confidence is as follows:

$$L\_{obj} = \lambda\_{noobj} \sum\_{i=0}^{S^2} \sum\_{j=0}^{B} \mathbf{1}\_{i,j}^{noobj} (c\_i - \pounds\_i)^2 + \lambda\_{obj} \sum\_{i=0}^{S^2} \sum\_{j=0}^{B} \mathbf{1}\_{i,j}^{obj} (c\_i - \pounds\_i)^2 \tag{5}$$

where *λnoobj* and *λobj*, respectively, represent the weight of the confidence loss when the object is not included and when it is included. *ci* and *c*ˆ*i*, respectively, represent the true value and predicted value of whether there is an object of category *i* in the current box. The other parameters have the same meaning as in the regression loss.

The category prediction loss uses the classic cross-entropy loss, and its calculation formula is as follows:

$$L\_{obj} = \lambda\_{noobj} \sum\_{i=0}^{S^2} \sum\_{j=0}^{B} \mathbf{1}\_{i,j}^{noobj} (\mathbf{c}\_i - \mathbf{c}\_i)^2 + \lambda\_{obj} \sum\_{i=0}^{S^2} \sum\_{j=0}^{B} \mathbf{1}\_{i,j}^{obj} (\mathbf{c}\_i - \mathbf{c}\_i)^2 \tag{6}$$

where *λclass* represents the weight of the category loss; *p*ˆ*i*(*c*) represents the predicted value of the confidence of the current category; and *pi*(*c*) is a conditional probability, which is obtained by obtaining a value of 0 or 1, depending on whether *S*<sup>2</sup> *<sup>i</sup>* contains the target center, and then multiplying it with *IoU*.

YOLOv4 uses CSPDarknet53 [38] as its feature extraction network, but CSPDarknet53 has lots of parameters. In addition, the only object to be detected in this paper is the damper. As shown in Table 1, ResNet101 is composed of multiple groups of residual blocks. ResNet has excellent feature extraction ability, which overcomes the problem of low learning efficiency caused by excessive network depth. Therefore, the classic ResNet101 is used as the backbone in this article.


**Table 1.** Applied kernels of ResNet101 in DamperYOLO.

#### **4. DamperYOLO**

In this section, a new framework named DamperYOLO is proposed for the vibration damper detection task of overhead transmission lines based on YOLOv4 [37], Canny algorithm [41], attention mechanism [42] and FPN [43] structure.

#### *4.1. Edge Extraction*

The quality of the input image is very important as it is the first step of the whole network detection, which directly affects the subsequent detection process. Although, strong noise immunity is one of the advantages of deep neural networks, no network would want to receive a high-quality input, so that the trained model parameters have more powerful attention to our target. Therefore, we decided to use edge detection techniques to improve the semantic information in images for the purpose of image enhancement, detailed in this subsection.

The canny algorithm is used to extract edge information from UAV aerial images. The canny algorithm is mainly divided into four parts: Gaussian smooth image, gradient magnitude and direction calculation, gradient magnitude nonmaximum suppression, double threshold algorithm detection and edge connection.

Our images are obtained by unmanned aerial photography and are highly susceptible to light reflections to generate exposure points. To reduce the influence of these bright white points, a Gaussian kernel is used to smooth the image.

Compared with the median filter [44] and the mean filter [45], the Gaussian filter assigns different calculation weights to different fields of the current element, which can achieve the purpose of denoising while preserving the gray distribution characteristics of the image. Gaussian filtering is usually implemented by iterative operations on the image with (2*k* + 1) × (2*k* + 1) convolution kernels. The kernel generation equation is shown in Equation (7).

$$H\_{ij} = \frac{1}{2\pi\sigma^2} \exp\left(-\frac{(i - (k+1))^2 + (j - (k+1))^2}{2\sigma^2}\right);\ 1 \le i, j \le (2k+1)\tag{7}$$

where *k* represents an integer, (2*k* + 1) represents the size of the convolution kernel, and (*i*, *j*) represents the coordinates of one of the points.

The size of the convolution kernel is usually set to an odd number for the convenience of calculation. The larger the kernel, the stronger the processing ability for local noise. In our experiments, kernels with sizes of 3 × 3, 5 × 5, and 9 × 9 were selected for comparison. The experimental results show that the kernel of 5 × 5 has the smallest effect.

After Gaussian smoothing, the background part still contains overexposed points. There is no need to worry about the negative impact this brings to the model, as the network focuses on the ground truth part during training. What must pay attention to is if the

feature of the vibration damper is improved, and edge detection is one of the important means of image enhancement. The parts of the image with high gradient variation in the canny algorithm task image represent a higher probability of edges. Therefore, our next step is to extract the gradient information of the image.

Gradients reflect the intensity of local pixel transformations. The greater the gradient change, the greater the change in the corresponding region. The gradient needs to calculate the direction and size of two parts, usually by calculating the gradient of the horizontal and vertical directions to represent a complete gradient. Its calculation formula is shown in Equations (8) and (9).

$$\frac{\partial f}{\partial \mathbf{x}} \approx \frac{f(\mathbf{x} + \mathbf{1}, y) - f(\mathbf{x} - \mathbf{1}, y)}{2} \tag{8}$$

$$\frac{\partial f}{\partial y} \approx \frac{f(\mathbf{x}, y+1) - f(\mathbf{x}, y-1)}{2} \tag{9}$$

The direction a and increment b of the gradient can be obtained based on the gradients in the horizontal and vertical directions, as shown in Equations (10) and (11).

$$\theta = \tan^{-1}(\frac{\partial f}{\partial y} / \frac{\partial f}{\partial x}) \tag{10}$$

$$\|\nabla f\| = \sqrt{\left(\frac{\partial f}{\partial \mathbf{x}}\right)^2 + \left(\frac{\partial f}{\partial y}\right)^2} \tag{11}$$

Gradient images contain all grayscale variations. Therefore, the canny algorithm uses the nonmaximum suppression method [41] to propose the lower gradient variation in the region.

The nonmaximum suppression algorithm calculates in eight areas around the pixel, retaining the parts with the largest grayscale changes in the horizontal, vertical, and diagonal directions while eliminating other parts with smaller changes by changing the broad-side gradient map to a single pixel width of the side.

The method of the nonmaximum suppression algorithm can only enhance the edge information and cannot guarantee that the remaining part is foreground information. Therefore, the last step of the canny algorithm is to use the double threshold algorithm to separate the foreground and background based on our prior knowledge.

In the double-threshold algorithm, the pixels above the strong edge threshold represent edge information, and the pixels below the weak edge threshold represent background information. The threshold between the two is the pending element, and if there is a strong edge in the eight-neighborhood of these pixels, the pixel is also classified as an edge pixel. Through comparison experiments of 200, 300, and 400 strong edge thresholds, it was found that the threshold of strong edge is best when the threshold is 300, and the weak edge threshold is set to 0.5 times of the strong edge. The formula for classifying gradient map pixels is shown in Equation (12).

$$f(i) = \begin{cases} \text{strong edge}; & \text{i} > 300\\ \text{weak edge}; & 150 \le \text{i} \le 300\\ \text{non} - \text{edge}; & \text{i} < 150 \end{cases} \tag{12}$$

To verify the effect of edge detection, we compared the performance of several classical edge detection operators on vibration dampers. As shown in Figure 1, the edge extracted by the Canny operator is the clearest.

**Figure 1.** Test examples of edge detection algorithm.

#### *4.2. Attention Mechanism*

After obtaining the edge information in the image using the canny algorithm, it can be used to produce positive effects. The attention mechanism [42] originated in the field of NLP and has been introduced into computer vision in recent years. As shown in Figure 2, by introducing additional convolution operations, the attention mechanism can focus on the additional information being added.

**Figure 2.** Schematic diagram of the attention mechanism.

The attention mechanism is based on the edge information obtained by the canny algorithm, and performs a convolution operation to obtain the attention weight matrix a. The expression of the convolution operation is shown in Equation (13).

$$I\_A^l = \text{Softmax}(I^l \mathcal{W}\_A^l + b\_A^l), \text{for } i = 1, 2 \tag{13}$$

where *I<sup>i</sup>* represents the input image, *W<sup>i</sup> <sup>A</sup>*, *<sup>b</sup><sup>i</sup> A* 2 *<sup>i</sup>*=<sup>1</sup> represents the parameter of the convolution operation, and Softmax(·) represents the SoftMax function used for normalization.

We multiplied the resulting attention weight matrix with the corresponding input image to obtain the final output:

$$I\_A = (I\_A^1 \otimes I^1) \oplus (I\_A^2 \otimes I^2) \tag{14}$$

where *IA* represents the final output result of the attention mechanism, *I*<sup>1</sup> and *I*<sup>2</sup> represent the input images, and the symbols ⊗ and ⊕ represent the multiplication and addition elements of the matrix.

Attention mechanism is used in ResNet101 to send the edge image output by the canny algorithm to the network to enhance the network's ability to focus on the ground truth region during feature extraction. We used an attention mechanism in layers 1, 2, and 3 of ResNet because the network focuses on the low-level features of the input image in the early stage of feature extraction. At the fourth and fifth layers, the output is a feature map with highly abstract semantics. At this time, the introduction of the attention

mechanism containing the edge map interferes with the effect of the feature map. A followup sensitivity analysis on where the attention mechanism is introduced proves our point.

#### *4.3. Feature Fusion Network*

After introducing edge detection and attention mechanisms, our framework improved to a certain extent. However, in the inspection data of overhead transmission lines captured by UAVs, the vibration damper is a small target object. When ResNet101 performs feature extraction, the deep network responds easily to semantic features and the shallow network responds easily to image features. This feature leads to a problem: although the high-level network can respond to semantic features, due to the small size of the Feature Map it does not contain much geometric information, which is not conducive to object detection. This problem is more pronounced for small-sized object detection. The vibration damper easily disappears in the feature map output by the fifth layer of ResNet because the target is small.

The disappearance of the vibration damper feature leads to a decrease in detection accuracy.

It is natural to think that a feature map that combines deep and shallow features can be used to meet the needs of small target detection. FPN [43] is a network structure that adopts this idea. FPN uses the idea of image pyramid to solve the problem of difficulty in detecting small-sized objects in object detection scenes. The traditional image pyramid method uses a multiscale image input to construct multiscale features. The biggest problem with this approach is that the recognition time is *k* times the recognition time of a single image, where k is the number of scaled dimensions.

To improve the detection speed, methods such as Faster R-CNN [46] use a single-scale Feature Map, but the single-scale feature map limits the detection capability of the model, especially for samples with extremely low coverage in the training set (such as larger and smaller samples). Unlike Faster R-CNN, which only uses the top-level Feature Map, SSD [47] uses the hierarchical structure of convolutional networks, starting from conv4\_3 of VGG [48], and obtains multiscale Feature Maps through different network layers. Although this method can improve accuracy and does not increase the test time, while it does not use the low-level Feature Map, these low-level features are very helpful for detecting small objects. In response to the above problems, FPN adopts the form of a Feature Map in the pyramid of SSD.

Different from SSD, FPN not only uses deep Feature Map in VGG, but also applies shallow Feature Map. These Feature Maps are efficiently integrated through bottom-up, top-down, and lateral connections, which improve the accuracy without greatly increasing the detection time. Therefore, as shown in Figure 3, this article refers to these practices and introduce a structure composed of FPN and bottom-up after the third, fourth, and fifth layers of ResNet101 so that the semantics and lines of the final output feature maps of the three scales' layer features are more abundant.

**Figure 3.** The Feature Fusion Network used for feature transfer containing two parts: the FPN and the Bottom-up module.

DamperYOLO was trained after all framework components were introduced. The training process is as described in Algorithm 1. As shown in Figure 4, the Edge Detection module, the ResnNet101 backbone, Attention Mechanism, the FPN and Bottom-up framework are used to construct the entire vibration damper detection process.

**Figure 4.** The realization of detection of vibration dampers is divided into three parts: Edge Detection, Feature Extraction, Feature Fusion. First, Edge Detection is used to provide edge information. then Feature Extraction and Feature Fusion are used to obtain feature maps for vibration dampers. Finally, the detection results can be obtained from classifier of YOLOv4.

#### **Algorithm 1: The Training Process of DamperYOLO.**

