2.1. Network Architecture
The YOLOv5 algorithm was developed by Glenn Jocher and his team in 2020. This algorithm effectively balances detection speed and accuracy, making it widely applicable in the field of object detection [
24]. The YOLOv5 algorithm comprises three main components: the backbone, the neck, and the detection head [
25]. Input images first undergo feature extraction in the backbone, generating three useful feature layers that are then passed to the neck network. The neck module performs feature fusion on the incoming feature layers. The top-down part of the neck network achieves feature fusion across different scales through up-sampling and fusion with coarser-grained feature maps. Subsequently, the bottom-up part mainly employs convolutional layers to fuse features across different scales. Finally, the feature maps from both the top-down and bottom-up parts are combined to obtain the ultimate feature map [
26]. The detection head serves as the classifier and regressor of the YOLOv5 model, assessing the features from the neck network to detect target objects [
27]. In this paper, the YOLOv5s algorithm is enhanced to produce EDF- YOLOv5, which offers increased accuracy in detecting defects on transmission lines, as shown in
Figure 1. Firstly, in the YOLOv5s backbone network, we introduce the EN-SPPFCSPC module to replace the Spatial Pyramid Pooling-Fast module (SPPF). The EN-SPPFCSPC comprehensively utilizes feature information, preventing the significant loss of detailed information, and thereby enhancing the algorithm’s capability to detect small target defects in power transmission lines. Secondly, we replace the C3 modules in the algorithm’s neck network section with DCNv3C3. This modification effectively enhances the algorithm’s ability to generalize across various shapes of power transmission line defects. Lastly, we introduce the Focal-CIoU loss function, replacing CIoU. This change enhances the gradient contributions of high-quality samples during the training process, effectively improving the algorithm’s convergence speed and detection accuracy.
2.2. EN-SPPFCSPC Module
In this paper, we are inspired by the SPPFCSPC structure [
28] and introduce EN-SPPFCSPC, which has improved feature extraction capabilities, effectively preventing the loss of detailed information in defective targets. The advantages of the EN-SPPFCSPC module primarily stem from two key improvements. Firstly, we employ three
-sized SoftPool [
29] sub-regions as a replacement for the max-pooling in SPPFCSPC. This substitution effectively mitigates the substantial loss of fine-grained details. SoftPool, applied to the input feature layer, activates feature points using a softmax-weighted approach within the pooling region. It then computes the weighted sum of all activated points within the pooling area, ultimately yielding the activation output of the pooling neighborhood. The principle of SoftPool is depicted in
Figure 2, where, assuming a pooling kernel size of
, we calculated the weight values,
, for each pixel point,
, within the feature region, R. If the pixel value of each point
is
, the activation weight is determined using Equation (1).
Subsequently, the weights are assigned to the respective pixel points and subjected to weighted summation, yielding the SoftPool output result,
, as shown in Equation (2).
SoftPool comprehensively considers every feature point within the pooling neighborhood, reducing information loss while preserving the overall receptive field. Moreover, each input feature receives a gradient, contributing to enhanced training effectiveness.
Furthermore, to effectively reduce the number of parameters, EN-SPPFCSPC introduces a multi-group mechanism concept, employing group convolution in place of a conventional convolution. For the input feature maps, it is divided into four groups along the channel dimension; convolution operations are then performed on each of these groups. The EN-SPPFCSPC structure is illustrated in
Figure 3. The input feature map undergoes feature extraction through two branches: one branch contains only a GCBR module, which is composed of grouped convolution, batch normalization, and the Relu activation function. The GCBR module in this branch utilizes a
convolutional kernel for grouped convolution with a group number of 4, effectively reducing the number of parameters and facilitating inter-channel information interaction. The other branch consists of three serial SoftPool operations with convolutional kernels of size
, along with multiple GCBR modules. In this branch, the feature map sequentially undergoes three SoftPool operations, followed by concatenation of the outputs from the three SoftPools and the output from the third GCBR along the channel dimension. This approach effectively controls the parameter count and maximizes the utilization of fine-grained feature information in the feature map. The EN-SPPFCSPC module, with its ability to fully integrate feature information, significantly enhances the algorithm’s capability to detect small defective targets.
2.3. DCNv3C3 Module
YOLOv5s employs fixed geometric structures for its convolution kernels, inherently limiting its ability to model geometric transformations. Therefore, in this paper, we combine the C3 module with Deformable Convolution v3 (DCNv3) [
30] to propose the DCNv3C3 module, which exhibits an enhanced adaptability to cope with changes in the shape of the target object. DCNv3 can adjust the sampling positions of convolution by position-offset maps, allowing the sampling points to better focus on the target objects. Simultaneously, it adjusts the weights of sampling points through the modulation maps. Therefore, compared to regular convolution, DCNv3 exhibits superior generalization capabilities, enabling it to adaptively locate and activate target units. A visual comparison of the sampling effects between regular convolution and DCNv3 is shown in
Figure 4.
Normal convolution employs fixed convolution kernels, D, to sample the input feature map
. The final output is obtained by weighting the convolution kernel’s weights with the pixel values at the sampling points. For example, with a 3 × 3 convolution kernel, considering the center position
as the reference, each sampling point’s relative position coordinates can be represented as
. The mathematical expression for normal convolution is provided in Equation (3):
In the equation, represents the sampled output of the convolution kernel, represents the projection weight of the sampling point of the convolution kernel, is the center position of the convolution kernel, and represents the pixel value of the input feature map at the corresponding sampling point location.
DCNv3 is depicted in
Figure 5. Assuming that
means the kernel size, the implementation of DCNv3 can be divided into two main steps. The first step involves conducting depth-wise and point-wise inferences on the input feature map
, resulting in the generation of position-offset maps
and modulation maps
, where
and
represents the number of groups. In the second step,
and M play flexible roles in the feature extraction procedure by focusing on the sampling object and modulating the weights, respectively. More detailed explanations with formulas are described in the following.
In the first step, the depth-wise direction inference, as indicated by Equation (4), and the channel-wise operation, are applied to the input
, which then results in
:
where DConv(·) is the depth-wise convolution and
converts channel dimensions to the last dimension. BN and GELU mean the normalization and activation function, respectively.
And the point-wise direction inference, as illustrated by Equations (5) and (6), a fully connected approach is employed to enable the sharing of projection weights among sampling points, ultimately computing bias offsets and modulation factors.
where
denotes the shared linear layer applied to each sampling point. Note that M passes through a softmax function and is therefore stable in the
dimension.
In the second step, grouping convolution operations are applied to the input,
, during the convolution operation for each group; the adjustment of sampling points of the conventional convolution kernel is accomplished using position-offset and modulation factors. This adjustment aims to better focus the sampling points on the target features. The output for any pixel of the input feature map is expressed as per Equation (7). DCNv3 not only exhibits enhanced geometric transformation capabilities but also, through the application of grouped convolution, segregates the spatial aggregation process into G groups. This effectively controls the parameter count and introduces diverse spatial aggregation patterns to the sampling process, thereby providing stronger feature information for downstream tasks.
In the equation, represents the number of groups, represents the group, represents the projection weight for the sampling position of the convolution kernel used for sampling in the group, and represents the modulation factor of the sampling point in the group. represents the position-offset of the sampling point in the g-th group.
The structure of DCNv3C3 is illustrated in
Figure 6, where DCNv3 is introduced into the C3 module. In the original C3 module, the BottleNeck part consists of two CSB structures and one residual connection. In the improved version, one of the CBS structures is optimized into a DBS structure which employs the more versatile DCNv3 in place of the normal convolution for feature sampling. The DCNv3C3 module requires setting the convolution group number for the DCNv3 operator based on the actual situation. In this paper, the parameters were set as follows according to the different numbers of input feature map channels:
.
2.4. Focal-CIoU Loss Function
In the YOLOv5 algorithm, the default choice for bounding box regression loss is the CIoU loss function, as shown in Equations (8)–(10). This loss function takes into account three factors: the overlap area between the predicted and true bounding boxes, the centroid distance, and the aspect ratio. However, it overlooks the issue of imbalance between low-quality and high-quality samples. Low-quality samples refer to predicted boxes with minimal overlap with the target box. These anchor boxes result in larger regression errors and have a negative impact on training.
In Equations (8) and (10), represents the Euclidean distance between the predicted box and the true box’s center point, represents the diagonal length of the minimum closed bounding region of the two bounding boxes, and are the width and height of the predicted box, respectively, and and are the width and height of the true box, respectively.
To enhance the contribution of high-quality samples during training, this study drew inspiration from the Focal-EIoU loss function [
31] and combined the ideas of the CIOU loss function and the Focal loss function, proposing the Focal-CIoU loss function. The Focal-CIoU loss function assigns weights to training samples based on IOU and the γ parameter, giving higher weights to high-quality samples. This reduces the contribution of low-quality samples in bounding box regression, allowing the algorithm to focus more on high-quality samples during training, which helps improve regression accuracy. The Focal-CIoU loss function is defined as follows in Equation (11):
where
is a parameter used to control the degree of suppression of outliers. In this study, the parameter setting method from Focal-EIoU was adopted, with
taking a value of 0.5.