A Lightweight CER-YOLOv5s Algorithm for Detection of Construction Vehicles at Power Transmission Lines

Yu, Pingping; Yan, Yuting; Tang, Xinliang; Shang, Yan; Su, He

doi:10.3390/app14156662

Open AccessArticle

A Lightweight CER-YOLOv5s Algorithm for Detection of Construction Vehicles at Power Transmission Lines

by

Pingping Yu

¹,

Yuting Yan

¹,

Xinliang Tang

¹,

Yan Shang

^1,* and

He Su

²

¹

College of Information Science and Engineering, Hebei University of Science and Technology, Shijiazhuang 050018, China

²

College of Electrical Engineering, Hebei University of Technology, Tianjin 300401, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(15), 6662; https://doi.org/10.3390/app14156662

Submission received: 4 July 2024 / Revised: 23 July 2024 / Accepted: 27 July 2024 / Published: 30 July 2024

(This article belongs to the Special Issue Deep Learning and Machine Learning in Image Processing and Pattern Recognition)

Download

Browse Figures

Versions Notes

Abstract

:

In the context of power-line scenarios characterized by complex backgrounds and diverse scales and shapes of targets, and addressing issues such as large model parameter sizes, insufficient feature extraction, and the susceptibility to missing small targets in engineering-vehicle detection tasks, a lightweight detection algorithm termed CER-YOLOv5s is firstly proposed. The C3 module was restructured by embedding a lightweight Ghost bottleneck structure and convolutional attention module, enhancing the model’s ability to extract key features while reducing computational costs. Secondly, an E-BiFPN feature pyramid network is proposed, utilizing channel attention mechanisms to effectively suppress background noise and enhance the model’s focus on important regions. Bidirectional connections were introduced to optimize the feature fusion paths, improving the efficiency of multi-scale feature fusion. At the same time, in the feature fusion part, an ERM (enhanced receptive module) was added to expand the receptive field of shallow feature maps through multiple convolution repetitions, enhancing the global information perception capability in relation to small targets. Lastly, a Soft-DIoU-NMS suppression algorithm is proposed to improve the candidate box selection mechanism, addressing the issue of suboptimal detection of occluded targets. The experimental results indicated that compared with the baseline YOLOv5s algorithm, the improved algorithm reduced parameters and computations by 27.8% and 31.9%, respectively. The mean average precision (mAP) increased by 2.9%, reaching 98.3%. This improvement surpasses recent mainstream algorithms and suggests stronger robustness across various scenarios. The algorithm meets the lightweight requirements for embedded devices in power-line scenarios.

Keywords:

power line; YOLOv5s; lightweight network; bidirectional feature pyramid; attention mechanism; receptive field

1. Introduction

As the scale of power grid construction continues to expand, incidents such as engineering vehicles touching power lines, accidentally hitting tower poles, and illegal excavation of foundations frequently occur, posing potential threats to the safe operation of transmission lines. The “Overhead Transmission Line Operation Management Specification” indicates that regular inspections are essential for ensuring the stable and reliable operation of the power grid system. The use of drones for inspections is safe, reliable, cost-effective, and highly efficient, and has become a significant method for inspecting transmission lines, replacing manual inspections [1]. However, the vast amount of images generated during inspections not only imposes a tremendous workload on staff but also makes visual identification of inspection targets inefficient [2]. Therefore, the intelligent detection of engineering vehicles is of great significance for transmission-line inspection tasks and for promoting the development of digital power grids.

Object detection algorithms are an important technology in the field of computer vision, used to identify specific targets in images or videos. With the continuous development of deep learning theories and image processing devices, there has been extensive research on detection algorithms in the context of transmission-line scenarios. Li et al. [3] proposed a defect detection algorithm for transmission-line bolts based on Faster R-CNN, utilizing SCNet-101 (sample consensus network-101) for feature extraction and improving the FPN (feature pyramid network) structure to enhance detection accuracy, though this also increased the model’s parameter count. Liu et al. [4], addressing the issue of low accuracy in insulator image defect detection algorithms in complex environments, re-clustered anchor box sizes, improved the BCE (binary cross-entropy loss) function, and increased the depth of convolutional layers in the SPP (spatial pyramid pooling) module, effectively enhancing algorithm accuracy. Li et al. [5] made improvements based on the YOLOv5 algorithm by introducing the SE_ECSP module to enhance feature weighting, adding a small target detection layer and improving the loss function to further enhance the accuracy for small targets. However, the model still experienced some missed detections. Yan et al. [6], to address the issues of inconspicuous features and low detection accuracy for small targets, conducted cross-level channel feature fusion on feature maps of different scales and used the CA (coordinate attention) module to aggregate positional information, effectively improving the YOLO algorithm’s small-target detection capability. Hao et al. [7] utilized a transformer module on top of YOLOv5 to capture global information of the targets, achieving an mAP of 95.6%. However, small defect targets in shaded areas were still missed.

Despite the significant achievements in object detection achieved in the existing research, challenges remain in complex environments. On one hand, existing models typically improve detection accuracy by deepening the network layers, which increases model complexity. On the other hand, due to the low resolution and insufficient feature information of small targets, the difficulty in multi-scale target detection mainly lies in the precise localization of small targets. To address the above issues, this paper improves the model structure and the non-maximum suppression (NMS) algorithm based on YOLOv5s, constructing a multi-scale target detection model. The main contributions include the following:

To address the issue of large parameter size and high memory usage in the C3 module, a CGC3 module was constructed, including a lightweight bottleneck structure and CBAM attention. This reduces model complexity while improving detection accuracy through dimension interaction.
An E-BiFPN (efficient bidirectional feature pyramid network) is proposed to facilitate hierarchical information flow and optimize multi-scale feature fusion capability.
Drawing inspiration from the human visual perception system, an ERM (receptive field enhancement module) was integrated into the feature fusion function to capture target context information and enhance small-target detection capability.
A Soft-DIoU-NMS post-processing algorithm is proposed, incorporating distance-related weighting factors to address candidate box redundancy issues, thus reducing the missed detection rate for occluded targets.

2. Materials and Methods

2.1. Dataset Construction

Currently, there is no unified public dataset for transmission-line engineering vehicles. Therefore, in this study, the experimental data were autonomously collected by drones, resulting in 718 images of engineering vehicles captured at different times and under different lighting conditions. The images had a resolution of 1920 × 1080, and the data categories included trucks, excavators, cranes, and loaders. Vehicle targets were annotated using the LabelImg tool v.1.4.0 and saved in Pascal VOC format as XML files, which were then converted to the YOLO format required for training.

To enrich data diversity and avoid overfitting during the training process, data augmentation methods as shown in Figure 1 were applied to expand the samples. As a result, 2154 images were obtained in total. The dataset was divided into training, validation, and test sets in an 8:1:1 ratio, resulting in 1724 training images, 215 validation images, and 215 test images. Figure 2 provides a detailed overview of the dataset. It can be observed that the data categories were imbalanced, with fewer large targets and more small targets in the images. This distribution is consistent with the main characteristics of drone images and further enhances the research significance of vehicle detection in transmission-line scenarios.

2.2. Structure of the CER-YOLOv5s Model

Based on the YOLOv5s 6.1 framework, this paper proposes an improved lightweight engineering-vehicle detection model, CER-YOLOv5s. As shown in Figure 3, its structure consists of three parts: the backbone, the neck, and the head [8].

In the backbone network, the CBS module employs convolution, batch normalization, and activation function operations for feature extraction, enhancing the model’s nonlinear fitting ability. The CGC3 module introduces a cross-stage merging strategy, reducing the computational complexity while increasing network depth. The SPPF module, building on the SPP module, removes redundant gradient information and enhances deep feature representation through max-pooling and tensor concatenation. In the neck network, an E-BiFPN feature fusion strategy is included, adding the 6th layer path to the 20th layer. This strategy utilizes upsampling and skip connection structures to achieve bidirectional flow and multi-scale fusion of features, thereby improving target perception. Additionally, the ERM module, which incorporates convolutional kernels of different sizes and dilated convolutions, has been added to expand the receptive field of shallow layers, enriching the contextual information of the target. The head network consists of three prediction layers of different scales, responsible for outputting target class, confidence, and location information. The CIoU bounding box loss function is used to improve the accuracy of target regression. In the post-processing stage, an improved non-maximum suppression algorithm is used to filter out redundant prediction boxes, resulting in the final prediction output.

2.2.1. CGC3 Module

As deep learning gradually advances towards edge computing, lightweight model structures have become a research focus in recent years. To meet the deployment requirements of object detection models on embedded hardware devices, this study utilized the Ghost [9] module and CBAM attention mechanism to reconstruct the C3 module, reducing the computational cost of standard convolutional layers. Unlike conventional convolution methods, the Ghost module enhances feature extraction from the perspective of feature redundancy while maintaining the same output as standard convolution. This approach improves the comprehensive understanding of input images. As shown in Figure 4, for a given input feature map, the Ghost module divides the convolution operation into two parts. The first part generates partial intrinsic feature maps using conventional convolution to prevent excessive computational load. The second part employs group convolution operations, utilizing simple linear transformations to individually perform 3

\times

3 or 5

\times

5 convolutions on the feature maps obtained from the first part. Finally, these two parts are concatenated along the channel dimension through identity mapping to produce the final output. When outputting

n

feature maps, the parameter count of a regular convolutional layer is

p_{1}

, the parameter count of the Ghost module is

p_{2}

, and the ratio of their parameter counts is:

\frac{p_{1}}{p_{2}} = \frac{n \cdot c \cdot k \cdot k}{\frac{n}{s} \cdot c \cdot k \cdot k + (s - 1) \cdot \frac{n}{s} \cdot d \cdot d} \approx \frac{s \cdot c}{s + c - 1} \approx s

(1)

where

c

represents the number of channels;

k

represents the size of the convolution kernel;

d

represents the size of the linear operation convolution kernel;

s

represents the number of linear operations. When

k

and

d

are the same size, the parameter count of a regular convolutional layer is

s

times that of the Ghost module, and similarly, the computational complexity of a regular convolutional layer is

s

times that of the Ghost module.

Similar to the residual structure, the Ghost bottleneck structure primarily consists of two cascaded Ghost modules. The first Ghost module expands the number of feature map channels to extract deeper information. The second Ghost module adjusts the channel count to match the residual path. Between the two Ghost modules, depthwise separable convolution with a stride of 2 is used for downsampling. Although the Ghost module significantly reduces the model’s computational cost, it struggles to capture spatial features effectively, inevitably leading to some accuracy loss. To address this issue, this paper integrates a lightweight CBAM [10] attention module into the Ghost bottleneck structure. By learning weights to selectively focus on specific areas of the input image, this integration enhances the model’s detection accuracy. The CBAM–Ghost structure is illustrated in Figure 5.

Given an intermediate feature as input, CBAM performs adaptive feature refinement along both channel and spatial dimensions. Specifically, it first obtains the channel attention weights and multiplies them elementwise with the input features. Next, the spatial attention module generates spatial attention weights. Finally, these weights are multiplied elementwise with the refined feature map to produce the final output. Figure 6 shows a comparison of the output heatmaps before and after adding the CBAM module. The red areas indicate the regions with the highest saliency, which contribute the most to the prediction results. It can be observed that the heatmap after adding the CBAM module pays more attention to key target information, addressing the issue of the original model lacking attention preferences, and thus laying a foundation for accurate detection of engineering vehicles.

As shown in Figure 7, replacing the bottleneck structure in the original C3 module with the CBAM–Ghost bottleneck structure resulted in a new C3 module, named the CGC3 module. Compared with the original structure, this module reduces gradient redundancy through a cross-stage fusion strategy and leverages the advantages of hybrid attention in extracting low-level visual features. By dynamically weighting important areas of the image, the CGC3 module enhances the model’s ability to extract vehicle features, which is beneficial for the practical application of detection algorithms.

2.2.2. E-BiFPN Feature Pyramid Network

The YOLOv5s model uses a traditional FPN [11] + PAN [12] feature pyramid structure to handle multi-scale object detection. This structure employs a pair of top-down and bottom-up pathways to merge information at different levels. Although this simple addition approach improves the model’s feature fusion capability to some extent, it overlooks the importance differences of feature maps with different resolutions during the fusion process, resulting in limited utilization of multi-scale features and restricting detection accuracy. To address this issue, this study combined the ECA [13] module with the BiFPN [14] feature fusion network to develop the proposed E-BiFPN (efficient bidirectional feature pyramid network). Its structure is shown in Figure 8.

Firstly, the intermediate nodes with single input and no feature fusion were removed, and skip connections were added between the same levels to retain information from the original input feature layers. The weighted fusion mechanism combines channel attention and fast normalization to adjust the contribution of feature maps at different scales. Equation (2) shows the method of fast normalization for feature fusion.

O = \sum_{i} \frac{w_{i}}{ε + \sum_{j} w_{j}} \cdot I_{i}

(2)

where

I_{i}

represents the input feature;

O

represents the output feature;

w_{i}

and

w_{j}

are learnable weights;

ε = 0.0001

is used to ensure numerical stability. The ECA module, built upon the weighted fusion, employs a local cross-channel interaction strategy without dimensionality reduction. It compresses spatial information through global average pooling and adaptively selects one-dimensional convolution kernel sizes to learn crucial information from different channels, further optimizing the feature fusion process. Its structure is depicted in Figure 9.

Considering network complexity, the proposed E-BiFPN structure described in this paper includes only one skip connection. Taking the

P_{4}

layer as an example, the feature fusion process is illustrated by Equations (3) and (4):

P_{4}^{t d} = C o n v (\frac{w_{1} \cdot P_{4}^{i n} + w_{2} \cdot R e s i z e (P_{5}^{i n})}{w_{1} + w_{2} + \in})

(3)

P_{4}^{o u t} = C o n v (\frac{w_{1}^{'} \cdot P_{4}^{i n} + w_{2}^{'} \cdot P_{4}^{t d} + w_{3}^{'} \cdot R e s i z e (P_{3}^{o u t})}{w_{1}^{'} + w_{2}^{'} + w_{3}^{'} + \in})

(4)

where

P_{4}^{i n}

represents the input to feature layer

n

;

P_{4}^{o u t}

represents the output of feature layer

n

;

C o n v

denotes the convolution operation;

R e s i z e

usually refers to upsampling or downsampling operations. Compared with the previous version, the E-BiFPN structure improves the feature extraction capability for multi-scale engineering-vehicle targets by aggregating image features multiple times, without significantly increasing computational overhead.

2.2.3. ERM Module

Small objects occupy fewer pixels in feature maps and have less distinct features, making them easy to overlook or confuse with background noise. This leads to difficulties in feature extraction and increased detection difficulty, resulting in missed detections. Shallow feature maps, which are rich in positional information, are advantageous for detecting small objects. However, shallow feature maps have larger sizes and relatively smaller receptive fields, thus lacking sufficient global information, resulting in poor small-object detection performance. Therefore, to improve the accuracy of small-object detection, this study introduced the ERM in the 18th layer. By simulating the human visual mechanism, this module enhances the network’s feature extraction ability, allowing it to better capture the global context information of shallow feature maps. This improves the distinguishability of small objects and further enhances detection accuracy.

As shown in Figure 10, the ERM module consists of two parts: multi-branch dilated convolutions and residual connections. The multi-branch dilated convolution section includes four parallel conventional convolutions and dilated convolutions with different dilation rates. Inspired by TridentNet [15], the ERM module first uses a 1

\times

1 convolution in the first layer of each branch to perform channel dimensionality reduction on the input feature layer. Next, two parallel 1

\times

3 convolutions and 3

\times

1 convolutions are used to replace the 3

\times

3 convolution, reducing the computational load of the model while enhancing width features. Similarly, two cascaded 1

\times

3 convolutions and 3

\times

1 convolutions are used to replace the 5

\times

5 convolution, enhancing height features. Finally, 3

\times

3 convolutions with dilation rates of 1, 3, 3, and 5 are used to capture feature information over a larger receptive field. Additionally, residual connections are added to retain the original input information, further enhancing the model’s nonlinear representation capability, ensuring that the detailed features of small objects are fully expressed.

2.2.4. Soft-DIoU-NMS Algorithm

In images of power transmission lines, there is a phenomenon where engineering-vehicle targets occlude each other. The YOLOv5s model uses the traditional NMS algorithm to eliminate overlapping detection boxes, but this method has the limitation of mistakenly deleting boundary boxes. The Soft-NMS algorithm [16] suppresses the competition between overlapping boxes by gradually decreasing the confidence scores of the target boxes, effectively reducing the problem of missed detections. However, it still uses IoU (intersection over union) to determine the overlap between two bounding boxes, ignoring the measurement of the distance between bounding boxes. Therefore, this paper proposes a Soft-DIoU-NMS algorithm based on DIoU [17]. The formula is shown in Equation (5):

\{\begin{array}{l} S_{i} = \{\begin{array}{l} S_{i}, I o U - R_{D I o U} (M, b_{i}) < N_{t} \\ S_{i} (1 - I o U (M, b_{i})), I o U - R_{D I o U} (M, b_{i}) \geq N_{t} \end{array} \\ R_{D I o U} = \frac{ρ^{2} (b, b^{g t})}{c^{2}} \end{array}

(5)

where

S_{i}

represents the score of the

i

th bounding box;

M

represents the bounding box with the highest current score;

b_{i}

represents the

i

th bounding box;

N_{t}

represents the set threshold;

R_{D I o U}

is the penalty term of the DIoU function. The schematic diagram is shown in Figure 11.

ρ^{2} (b, b^{g t})

represents the Euclidean distance between the center points of the predicted box and the ground truth box and

c

represents the diagonal length of the smallest enclosing rectangle that covers both the predicted box and the ground truth box.

The Soft-DIoU-NMS algorithm employs a score penalization mechanism. When the DIoU value between the highest-scoring bounding box and another bounding box falls below a set threshold, the score of the bounding box remains unchanged. Otherwise, the score of the bounding box is linearly reduced, and it participates in the next iteration until all retained detection boxes are identified. The Soft-DIoU-NMS algorithm not only considers the overlapping area of the bounding boxes but also uses DIoU to minimize the distance between the target bounding boxes. This leads to faster model convergence, higher recall rates for occluded targets, and more reasonable detection results.

3. Experimental and Analysis

3.1. Experimental Environment and Parameter Settings

The experiment used Ubuntu 18.04 as the operating system with an NVIDIA Tesla V100S-PCIE-32GB GPU. It was conducted in a PyTorch 1.8.1 deep learning framework environment with Python 3.8.18, utilizing CUDA 10.1 and CuDNN 7.6.5 for accelerated training. To ensure the fairness of the experiments, all experiments were based on the YOLOv5s model, using pre-trained weights for transfer learning. The cosine annealing strategy was used to adjust the learning rate during training. The experimental parameters were set as shown in Table 1.

3.2. Evaluation Metrics

This study selected average precision (AP) and mean average precision (mAP) as the main performance evaluation metrics for the model. Average precision considers the relationship between precision and recall, represented as the area under the P-R curve. mAP is the average of the AP values for all categories. Additionally, FPS was introduced to represent the model detection speed, GFLOPs and Params measured the model’s time and space complexity, and the model size was used to reflect memory usage. The calculation formulas were as follows:

P r e c i s i o n = \frac{T P}{T P + F P}

(6)

R e c a l l = \frac{T P}{T P + F N}

(7)

A P = \int_{0}^{1} P (R) d R

(8)

m A P = \frac{1}{N} \sum_{i = 1}^{N} A P_{i}

(9)

where

T P

represents true positives;

F P

represents false positives;

F N

represents false negatives (the number of missed detections);

N

denotes the total number of categories.

3.3. Qualitative Experiments

The loss and mAP curves during the training process before and after the model improvement are shown in Figure 12. As seen in Figure 12a, within the first 110 training epochs, the mAP value of the YOLOv5s model was higher than that of the proposed model. Around epoch 130, the mAP values of both models were equal. Subsequently, the mAP value of the proposed model surpassed that of YOLOv5s and eventually stabilized at around 0.98. As seen in Figure 12b, the initial loss values of both models were around 0.16. In the first 50 training epochs, the loss value of the proposed model decreased rapidly, indicating higher learning efficiency in the early stages. After approximately 100 training epochs, the loss values of both models tended to stabilize, and the models gradually converged. Throughout the entire iteration process, the loss value of the proposed model remained consistently lower than that of the original model. At the end of 300 training epochs, the loss value of the proposed model was around 0.035, while the loss value of YOLOv5s was around 0.045. The results indicated that, under the same conditions, the proposed model demonstrated a more significant performance advantage in the target detection task.

Table 2 shows the detection results for each category of engineering vehicles before and after the model improvement. It was observed that the mAP value of the improved CER-YOLOv5s model reached 98.3%, an increase of 2.9% compared with the original YOLOv5s. The average precision for all four categories of engineering vehicles also showed significant improvement, with the truck category seeing the most substantial increase, improving by 4.8 percentage points compared with before the improvement. The experiments indicated that the improved model enhanced the overall detection capability for various types of targets.

3.4. Ablation Study

3.4.1. Ablation Study on the Replacement Positions of the CGC3 Module

To validate the impact of different replacement positions of the CGC3 module on model performance, an ablation study was conducted using YOLOv5s as the reference baseline. The results are shown in Table 3. YOLOv5s + backbone and YOLOv5s + neck represent the replacement of the C3 module in the backbone network and the neck network, respectively. Compared with the baseline model, the mAP increased by 1.1% and 0.8%, the number of parameters decreased by 1.14 M and 0.95 M, and the computational load decreased by 3.4 G and 2.0 G, respectively. In contrast, YOLOv5s + all, which replaced the C3 module throughout the entire network, achieved a final mAP of 97.1%. The parameters and computational load were only 70.1% and 66.9% of the baseline model, respectively, making it more conducive to achieving model lightweighting.

3.4.2. Ablation Experiments on the Dilation Rate Parameters of the ERM Module

To evaluate the impact of the dilation rate parameters of the ERM module on model performance and computational complexity, three different parameter combinations for the ERM module were set up for ablation comparison. The experimental results are shown in Table 4. It was observed that the ERM modules with different dilation rate parameter combinations effectively improved the detection accuracy and had a minimal impact on the model’s computational load. Among them, the ERM module with a dilation rate combination of (1,3,3,5) achieved the best detection performance with a detection accuracy of 96.2%, while maintaining a low computational load and number of parameters. Therefore, this dilation rate parameter setting was used as the benchmark in this study.

3.4.3. Ablation Study of Each Improved Module

To evaluate the impact of various improvement modules on detection performance, a set of ablation experiments was designed using mAP, Params, GFLOPs, and model size as evaluation metrics. The experiments were conducted with the same dataset, experimental environment, and training strategy. The results are shown in Table 5, where “√” indicates that the improvement module was used and “—” indicates that the improvement module was not used.

Experiment 2 involved adding only the CGC3 module, which increased the mAP by 1.7% compared with the original model, reaching a final mAP of 97.1%. The parameter count was reduced to 70.1% of the original model, demonstrating that the Ghost bottleneck structure in the CGC3 module effectively reduced the model complexity. Additionally, the integrated CBAM attention mechanism adaptively extracted key features, enhancing detection performance. Experiment 3 involved replacing the original feature pyramid network with the E-BiFPN structure. This approach, by fully integrating shallow and deep information, improved the efficiency of multi-scale feature fusion. Although this led to a slight increase in complexity compared with the original model, the mAP improved by 2%.

Experiment 4 involved adding the ERM (enhanced receptive field module) to the neck of the network. By using dilated convolutions with different dilation rates to capture multi-scale contextual relationships of targets, this module effectively improved the detection accuracy for small targets. Despite a slight increase in model parameters, the mAP increased by 0.8%. Experiment 5 aimed to address the issue of the traditional Soft-NMS algorithm’s insensitivity to distance measurement by proposing the Soft-DIoU-NMS algorithm. This new approach reduced incorrect suppression of occluded targets, resulting in a 1.3% improvement in mAP. From the perspective of combining improvement modules, Experiment 6, which added E-BiFPN to the setup from Experiment 2, further enhanced the model’s feature extraction capability. This resulted in an additional 0.3% increase in mAP. Experiment 7 built on Experiment 6 by adding the ERM module, which expanded the receptive field of shallow features. This addition led to a 0.5% increase in mAP. Experiment 8 involved simultaneously incorporating all four improvement modules into the model. The final mAP reached 98.3%, with the model’s parameter count and computational complexity reduced to 5.08 M and 10.9 G, respectively. This further demonstrated the effectiveness of the proposed improvement strategies in the target detection task.

3.5. Comparison Experiments

3.5.1. Comparison Experiment with Feature Fusion Structures

To validate the effectiveness of the E-BiFPN structure, this study compared it with PANet, BiFPN, and AFPN [18] feature fusion structures. To ensure the fairness of the experiment, the backbone network was fixed, and only the performance of different feature fusion structures was compared. As shown in Table 6, the E-BiFPN structure proposed in this paper achieved a better balance between model complexity and detection accuracy, making it more suitable for engineering-vehicle detection tasks.

3.5.2. Receptive Field Module Comparison Experiment

To evaluate the performance of the ERM module in terms of detection accuracy and model complexity, three different receptive field modules were added to YOLOv5s, and comparisons were made using the same experimental environment and training parameters. The experimental results are shown in Table 7. Compared with the ASPP [19] and RFB-s [20] modules, the ERM module improved the model’s mAP from 95.4% to 96.2%, with minimal increase in model parameters and computational load. Figure 13 visualizes the data from Table 7, clearly demonstrating that the MRB module performed best in both detection accuracy and model complexity.

3.5.3. Comparison Experiment with Detection Algorithms

To effectively assess the detection performance of the proposed algorithm, model weight size was added as a lightweight evaluation metric. The improved algorithm was compared with one-stage algorithms SSD, YOLOv5s, YOLOv7 [21], and YOLOv8s, as well as the two-stage algorithm Faster R-CNN and popular lightweight algorithms. The results are shown in Table 8.

From Table 8, it is evident that early SSD and Faster R-CNN, as well as the recently proposed YOLOv7, have large parameter and computational requirements, which hinder their deployment on hardware platforms. Compared with YOLOv8s, the proposed algorithm improved mAP by 2.1% and reduced the model size by 8.0 MB. Compared with lightweight algorithms like YOLOv3-tiny and YOLOv7-tiny, our algorithm achieved significant improvements in mAP by 19.2% and 9.0%, respectively. Additionally, it decreased GFLOPs by 16.2% and 17.4%, respectively, reduced Params by 41.5% and 15.8%, respectively, and decreased the model size by 6.5 MB and 1.6 MB, respectively. Compared with YOLOX-tiny, the algorithm proposed in this paper has slightly higher complexity but demonstrated greater advantages in detection accuracy and model size metrics. At the same time, compared with YOLOv5s, the proposed algorithm reduced the number of parameters by 27.8% and GFLOPs by 31.9%, while achieving an mAP of 98.3% and a model size of only 10.1 MB.

To more thoroughly demonstrate the effectiveness of the proposed algorithm, we compared it with improved algorithms from the related literature while keeping the dataset and experimental parameters unchanged. Compared with references [22] and [23], although the proposed algorithm had slightly fewer parameters and lower computational complexity, it achieved improvements in accuracy of 4.6% and 5.5%, respectively. Reference [24] had lower model memory usage, but its mAP was 91.7%, which was 6.6 percentage points lower than the mAP of the proposed algorithm. In summary, the CER-YOLOv5s algorithm balanced better detection accuracy and model lightweighting, meeting the needs for engineering-vehicle detection in complex environments and making it more suitable for deployment on devices with limited computational resources.

The algorithms mentioned above were visualized and analyzed based on the same test set, as shown in Figure 14. Due to cluttered background interference, YOLOX-tiny detected the excavator twice in the third image, while YOLOv7-tiny mistakenly detected the red container in the second image as a truck. The SSD and Faster R-CNN algorithms, although free from false detections, exhibited poor performance in detecting small targets in the third image. While YOLOv7 and YOLOv5s show improved detection accuracy, their confidence scores for detecting small objects were below 0.9. In contrast, our algorithm demonstrated better detection performance overall without encountering false positives.

3.6. Model Robustness Analysis

To evaluate the robustness of the proposed model, test images containing different scenes were selected for validation analysis, with the detection results shown in Figure 15. From left to right, the images include the original image, the detection results for the YOLOv5s model, and the detection results for the CER-YOLOv5s model. Observing Figure 15a,b, it can be seen that the YOLOv5s model failed to accurately detect the excavator target in environments with either ample daylight or insufficient night-time illumination, leading to false negatives and false positives. In contrast, the improved model demonstrated stronger robustness to changes in lighting conditions, accurately predicting target locations and category information, thus providing reliable technical support for night-time monitoring. Observing Figure 15c, it is evident that CER-YOLOv5s performed better than the YOLOv5s model in detecting small targets at a distance, effectively increasing the confidence in small target detection. From Figure 15d, it can be seen that CER-YOLOv5s was able to effectively recognize different targets in dense multi-scale scenes with higher detection confidence compared with the previous model, providing a beneficial solution for detection tasks in complex scenes. Figure 15e shows that, in complex occlusion scenarios, CER-YOLOv5s demonstrated better target association and background reasoning abilities, accurately detecting partially occluded excavator targets and reducing the false positive rate. Overall, CER-YOLOv5s showed better generalization and stronger robustness across different scenarios.

4. Conclusions

This paper proposes a lightweight detection algorithm for engineering vehicles, CER-YOLOv5s, aimed at improving multi-scale target detection accuracy in transmission-line scenarios while reducing model computational resources. Specifically, the C3 module introduces the Ghost bottleneck structure and a hybrid attention mechanism to reduce model complexity while enhancing the saliency of key targets. An E-BiFPN structure is proposed to bidirectionally fuse deep and shallow information, improving the utilization of shallow features and enhancing adaptability to multi-scale target detection. The ERM module has been added to the neck feature fusion part to enhance the model’s fine-grained information extraction capability. The Soft-DIoU-NMS algorithm is used in the post-processing stage to obtain the best prediction boxes, improving the detection accuracy for overlapping targets. Experimental results showed that the proposed algorithm achieved high detection accuracy without increasing storage overhead, reaching an mAP of 98.3% on the validation set, with a model size of only 10.1 MB. Additionally, visualized detection results indicated that the improved model demonstrated strong environmental adaptability in various scenarios, providing a theoretical reference for detection of engineering vehicles. Future work will focus on further optimizing the detection model based on actual transmission-line scenarios and implementing the model on hardware platforms for mobile detection.

Author Contributions

Conceptualization, P.Y. and Y.Y.; methodology, Y.Y.; software, X.T.; validation, P.Y., Y.Y. and Y.S.; formal analysis, H.S.; investigation, Y.S.; resources, H.S.; data curation, Y.Y.; writing—original draft preparation, Y.Y.; writing—review and editing, Y.Y.; visualization, P.Y.; supervision, X.T.; project administration, P.Y.; funding acquisition, P.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Practice Project on Higher Education Teaching Reform, Hebei Provincial Department of Education (2021GJJG198).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used in this study are not publicly available due to significant time investment, and researchers interested in obtaining these data to verify the validity of their findings should contact the corresponding author of this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wong, S.Y.; Choe, C.W.C.; Goh, H.H.; Low, Y.W.; Cheah, D.Y.; Pang, C. Power Transmission Line Fault Detection and Diagnosis Based on Artificial Intelligence Approach and its Development in UAV: A Review. Arab. J. Sci. Eng. 2021, 46, 9305–9331. [Google Scholar] [CrossRef]
Hao, S.; Ma, R.Z.; Zhao, X.S.; An, B.; Zhang, X.; Ma, X. Fault detection meth-od for YOLOv3 transmission lines based on Convolutional Block attention model. Power Grid Technol. 2021, 45, 2979–2987. [Google Scholar]
Li, X.F.; Liu, H.Y.; Liu, G.H.; Su, H. Transmission line pin defect detection based on deep learning. Power Grid Technol. 2021, 45, 2988–2995. [Google Scholar]
Liu, X.M.; Tian, H.; Yang, Y.M.; Wang, Y.; Zhao, X. Research on image detection method of insulator defects under complex environmental background. J. Electron. Meas. Instrum. 2022, 36, 57–67. [Google Scholar]
Li, Z.F.; Yang, F.B.; Hao, Y.Q. An aerial photography small target detection algorithm based on residual network optimization. Foreign Electron. Meas. Technol. 2022, 41, 27–33. [Google Scholar]
Yan, J.H.; Zhang, K.; Shi, T.J. Detection of weak and small ground targets in remote sensing images integrating multi-level features. J. Instrum. 2022, 43, 221–229. [Google Scholar]
Hao, S.; Zhao, X.S.; Ma, X.; Zhang, X.; He, T.; Hou, L.-X. Multi-type defect target detection method for transmission lines based on TR-YOLOv5. J. Graph. 2023, 44, 667–676. [Google Scholar]
Xue, Z.; Xu, R.; Bai, D.; Lin, H. YOLO-Tea: A tea disease detection model improved by YOLOv5. Forests 2023, 14, 415. [Google Scholar] [CrossRef]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. GhostNet: More features from cheap operations. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the 15th European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the 2017 IEEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and efficient object detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11531–11539. [Google Scholar]
Li, Y.; Chen, Y.; Wang, N.; Zhang, Z. Scale-Aware trident networks for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Bodla, N.; Singh, B.; Chellappa, R.; Davis, L.S. Soft-NMS—Improving object detection with one line of code. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. Proc. AAAI Conf. Artif. Intell. 2020, 34, 12993–13000. [Google Scholar] [CrossRef]
Yang, G.; Lei, J.; Zhu, Z.; Cheng, S.; Feng, Z.; Liang, R. AFPN: Asymptotic feature pyramid network for object detection. In Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics (SMC), Honolulu, Oahu, HI, USA, 1–4 October 2023; pp. 2184–2189. [Google Scholar]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
Liu, S.; Huang, D.; Wang, Y. Receptive field block net for object accurate and fast detection. In Proceedings of the European Conference Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 385–400. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y. YOLOv7: Trainable bag-of-freebies sets new state-of-the art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Jing, X.P.; Tian, Y. Lightweight vehicle detection using long distance dependence and multi-scale representation. Opt. Precis. Eng. 2023, 31, 950–961. [Google Scholar] [CrossRef]
Lou, Z.; Li, P.; Song, F.; Sun, Q.; Ding, H. Lightweight passion fruit detection model for embedded devices. Trans. Chin. Soc. Agric. Mach. 2022, 53, 262–269. [Google Scholar]
Wang, C.; Zhang, B.; Cao, Y.; Sun, M.; He, K.; Cao, Z.; Wang, M. Mask detection method based on YOLO-GBC network. Electronics 2023, 12, 408. [Google Scholar] [CrossRef]

Figure 1. Data augmentation methods: (a) original image; (b) brightness changed; (c) cropped image; (d) Gaussian noise added; (e) rotated image; (f) translated image.

Figure 2. Overview of the dataset: (a) label categories; (b) bounding box size distribution.

Figure 3. CER-YOLOv5s network model.

Figure 4. Ghost module.

Figure 5. Structure of CBAM–Ghost Bottleneck.

Figure 6. Thermal map visualization results: (a) original image; (b) regular heatmap; (c) CBAM heatmap.

Figure 7. CGC3 module structure.

Figure 8. The E-BiFPN network structure.

Figure 9. ECA module.

Figure 10. ERM module.

Figure 11. DIoU penalty-term diagram.

Figure 12. The performance parameter curves: (a) mAP curves; (b) loss curves.

Figure 13. Performance comparison of different receptive field modules.

Figure 14. Comparison of the detection effect of different algorithms.

Figure 15. Detection performance in different scenarios: (a) well-lit scene; (b) low-light scene; (c) scene with a small target; (d) scene with multi-scale targets; (e) occluded scene.

Table 1. Experimental parameter setting.

Parameters	Numerical Value
epochs	300
batch size	16
image size	640 × 640
initial learning rate	0.01
optimizer	SGD
momentum	0.937
weight decay	0.0005

Table 2. Comparison of model detection performance.

Models	mAP/%	AP/%
Models	mAP/%	Truck	Excavator	Crane	Loader
YOLOv5s	95.4	92.3	95.5	97.6	96.1
CER-YOLOv5s	98.3	97.1	98.2	99.1	98.7

Table 3. Ablation study of replacing positions with the CGC3 module.

Models	mAP/% ↑	Params/M ↓	GFLOPs ↓
YOLOv5s	95.4	7.03	16.0
YOLOv5s + backbone	96.5	5.89	12.6
YOLOv5s + neck	96.2	6.08	14.0
YOLOv5s + all (our proposed model)	97.1	4.93	10.7

Table 4. Ablation study on the dilation rate parameter.

Models	mAP/% ↑	Params/M ↓	GFLOPs ↓
YOLOv5s	95.4	7.03	16.0
YOLOv5s + ERM (r1 = 1, r2 = 3, r3 = 3, r4 = 5)	96.2	7.11	16.2
YOLOv5s + ERM (r1 = 3, r2 = 3, r3 = 3, r4 = 5)	95.9	7.13	15.5
YOLOv5s + ERM (r1 = 3, r2 = 5, r3 = 5, r4 = 7)	95.5	7.22	16.2

Table 5. The results of the ablation experiment.

Experiment	CGC3	E-BiFPN	ERM	Soft-DIoU-NMS	mAP/% ↑	Model Size/MB ↓	GFLOPs ↓	Params/M ↓
1	—	—	—	—	95.4	13.7	16.0	7.03
2	√	—	—	—	97.1	9.8	10.7	4.93
3	—	√	—	—	96.8	13.9	16.2	7.10
4	—	—	√	—	96.2	13.8	16.0	7.11
5	—	—	—	√	96.7	13.7	16.0	7.03
6	√	√	—	—	97.4	10.0	10.9	5.00
7	√	√	√	—	97.9	10.1	10.9	5.08
8	√	√	√	√	98.3	10.1	10.9	5.08

Table 6. Comparison of feature fusion structures.

Models	mAP/% ↑	Params/M ↓	GFLOPs ↓
YOLOv5s + PANet	95.4	7.03	16.0
YOLOv5s + BiFPN	96.1	7.10	16.2
YOLOv5s + AFPN	95.5	6.49	15.5
YOLOv5s + E-BiFPN	96.8	7.10	16.2

Table 7. Experimental comparison of receptive field modules.

Models	mAP/% ↑	Params/M ↓	GFLOPs ↓
YOLOv5s	95.4	7.03	16.0
YOLOv5s + ASPP	96.1	7.59	22.9
YOLOv5s + RFB-s	95.7	7.13	17.2
YOLOv5s + ERM	96.2	7.11	16.0

Table 8. Experimental comparison of detection algorithms.

Models	mAP/% ↑	Params/M ↓	GFLOPs ↓	FPS ↑	Model Size/MB ↓
SSD	84.5	24.15	61.2	21.70	92.1
Faster R-CNN	87.4	136.77	369.8	2.42	108.0
YOLOv5s	95.4	7.03	16.0	60.66	13.7
YOLOv7	84.4	37.21	105.2	27.47	71.3
YOLOv8s	96.2	11.13	28.4	51.02	18.1
YOLOv3-tiny	79.1	8.68	13.0	94.43	16.6
YOLOv7-tiny	89.3	6.03	13.2	87.37	11.7
YOLOX-tiny	96.6	5.06	6.45	61.49	19.42
Reference [22]	93.7	4.11	7.3	72.10	17.9
Reference [23]	92.8	4.53	9.6	53.42	6.4
Reference [24]	91.7	7.17	11.3	64.61	3.9
CER-YOLOv5s	98.3	5.08	10.9	63.53	10.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yu, P.; Yan, Y.; Tang, X.; Shang, Y.; Su, H. A Lightweight CER-YOLOv5s Algorithm for Detection of Construction Vehicles at Power Transmission Lines. Appl. Sci. 2024, 14, 6662. https://doi.org/10.3390/app14156662

AMA Style

Yu P, Yan Y, Tang X, Shang Y, Su H. A Lightweight CER-YOLOv5s Algorithm for Detection of Construction Vehicles at Power Transmission Lines. Applied Sciences. 2024; 14(15):6662. https://doi.org/10.3390/app14156662

Chicago/Turabian Style

Yu, Pingping, Yuting Yan, Xinliang Tang, Yan Shang, and He Su. 2024. "A Lightweight CER-YOLOv5s Algorithm for Detection of Construction Vehicles at Power Transmission Lines" Applied Sciences 14, no. 15: 6662. https://doi.org/10.3390/app14156662

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Lightweight CER-YOLOv5s Algorithm for Detection of Construction Vehicles at Power Transmission Lines

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset Construction

2.2. Structure of the CER-YOLOv5s Model

2.2.1. CGC3 Module

2.2.2. E-BiFPN Feature Pyramid Network

2.2.3. ERM Module

2.2.4. Soft-DIoU-NMS Algorithm

3. Experimental and Analysis

3.1. Experimental Environment and Parameter Settings

3.2. Evaluation Metrics

3.3. Qualitative Experiments

3.4. Ablation Study

3.4.1. Ablation Study on the Replacement Positions of the CGC3 Module

3.4.2. Ablation Experiments on the Dilation Rate Parameters of the ERM Module

3.4.3. Ablation Study of Each Improved Module

3.5. Comparison Experiments

3.5.1. Comparison Experiment with Feature Fusion Structures

3.5.2. Receptive Field Module Comparison Experiment

3.5.3. Comparison Experiment with Detection Algorithms

3.6. Model Robustness Analysis

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI