A Multiscale Grouped Convolution and Lightweight Adaptive Downsampling-Based Detection of Protective Equipment for Power Workers

Liu, Xin; Li, Yingna

doi:10.3390/electronics13112079

Open AccessArticle

A Multiscale Grouped Convolution and Lightweight Adaptive Downsampling-Based Detection of Protective Equipment for Power Workers

by

Xin Liu

¹ and

Yingna Li

^2,*

¹

Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650032, China

²

Computer Technology Application Key Lab of the Yunnan Province, Kunming 650500, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(11), 2079; https://doi.org/10.3390/electronics13112079

Submission received: 1 April 2024 / Revised: 16 May 2024 / Accepted: 21 May 2024 / Published: 27 May 2024

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

Convolutional neural network-based detection models have been extensively applied in industrial production for monitoring the use of safety protection equipment, ensuring worker safety. This paper addresses scenarios in electrical operations where the safety protection requirements are more comprehensive and stringent. This paper proposes an improved detection model, ML-YOLOv8n-Light, based on YOLOv8n, targeting issues of low detection efficiency and large model size that make deployment challenging in current safety protection equipment-wearing detection models for electrical operations. Our model confronts the disparity in safety protection equipment sizes with a novel, lightweight multi-scale grouped convolution (MSGC) scheme integrated into the architecture. A lightweight adaptive weight downsampling (LAWD) mechanism is also fashioned to replace the traditional downsampling methods, optimizing resource consumption without sacrificing performance. Additionally, to enhance the detection fidelity of smaller items, such as insulated gloves, we added feature-rich shallow maps and a dedicated detection head for such objects. To enhance the detection efficiency of YOLOv8n, inspired by part convolution, we improved the spatial pyramid pooling fast (SPPF) and the detection heads. The experiments conducted on the custom dataset power safe attire dataset (PSAD) showed that compared to the original model, mAP50 increased by 2.1%, mAP50-95 by 3.1%, with a 29% reduction in parameters, an 18% reduction in computations, and a 23% compression of the model size. There are fewer detection omissions at long distances and under occlusion, fewer false positives, and computing resources are allocated more efficiently.

Keywords:

safety protective equipment; YOLOv8; multiscale convolution; adaptive-weight downsampling; lightweight

1. Introduction

In the power industry and other high-risk industrial sectors such as oil, gas, mining, and construction, the wearing of protective equipment by workers is a core element that affects work safety. Wearing the required protective equipment during construction, such as the most basic helmet, allows workers to resist the penetration of objects, absorb the impact of direct blows to the head, and effectively minimize injuries sustained in the event of an accident. In addition, in the electric power industry, according to the “Electricity Safety Regulations”, workers are not only limited to head protection, but protective equipment expands to whole body protection, including but not limited to insulated clothing, insulated pants, insulated gloves, etc. The comprehensiveness of the protective equipment is essential to prevent potential dangers such as electric shock and arc flash. According to the annual statistics of electrical accidents in Europe and the United States [1], arc accidents accounted for 70%, which caused more than 20% of the proportion of permanent injuries, and each year about 2000 workers experience arc burns and are hospitalized in burn centers, while thousands of workers face the potential dangers of arc accidents, which may cause serious or even fatal consequences.

While most of the current on-site worker safety protective equipment-monitoring system relies on the mode of manual visual inspection, with the increase in monitoring time and scope, the traditional manual supervision method will cause visual fatigue, leading to misjudgment, omission of inspection, but also consuming a large amount of materials and human resources, unable to meet the current safety management requirements for high-risk industries. There have been many researchers who have conducted relevant studies on deep-learning-based detection algorithms for workers’ safety protective equipment under electric power scene operations, which have utilized surveillance images to automatically detect the wearing of safety protective equipment by on-site workers, and have the advantages of low cost, fast deployment, and high detection efficiency.

Xingke Li et al. [2] addressed the challenge of accurate and fast detection of safety helmet use in power plant operations, and proposed an advance to the YOLOv5 algorithm. The innovation centers around the introduction of a weighted BiFPN [3] network structure replacing the traditional FPN + PAN [4] combination in the algorithm’s neck for improved feature integrity and helmet detection precision. Furthermore, the integration of the convolutional block attention module (CBAM) [5] and the use of its attention mechanism refines feature extraction, while Kmeans++ clustering optimizes the anchor box selection. Comparative studies have revealed that the modified algorithm achieves a 90.44% mAP, outperforming the original YOLOv5 by 6.15%; however, as the model volume increases, the detection speed decreases.

Peiyun Feng et al. [6] proposed an enhanced Cascade R-CNN [7] algorithm, and designed the D-ResNet50 and CARA-RFP modules, which can adaptively learn the convolutional offsets while efficiently utilizing the feedback information for the re-refinement of the features to improve the reusability of the features and the detection performance of the model. It permits the upgraded Cascade R-CNN algorithm to improve the average precision and recall by 3.5% and 1.9%, respectively, over the original model, and overcomes the problem whereby the large variation in the shape scale of the helmet detection target leads to the low detection precision of the model, so it is easy to misdetect and omit detection. However, there was a 1.5% increase in model computation and a 3.6% increase in the number of parameters, with an overall rise in model complexity.

Jing Ma et al. [8] sought to improve the detection accuracy for safety equipment at electrical construction sites. Their paper presented STDAB-DETR, utilizing Swin transformer [9] as a backbone within a detection transformer framework for its strong feature extraction capabilities. The method addresses traditional CNN shortcomings via hierarchical structures and attention mechanisms, efficient feature fusion with multi-scale architecture, and deformable convolution networks to better model complex shapes. Additionally, a new bounding-box loss function blending Smooth L1 [10] and CIOU was introduced to refine the object localization based on overlap, size, and shape considerations. But the model is too large to be deployed on mobile devices and is slow to detect.

Wang Ru et al. [11] proposed an improved algorithm Wear-YOLO based on YOLOv8 for the current safety-equipment detection algorithm used by substation power operators. In order to capture the contextual information more accurately in the complex power scenario, the MobileViTv3 [12] module was introduced, and WIoUv3 [13] was used as the bounding-box regression loss function, effectively allowing the model to focus on anchor frames with a higher quality, improving the accuracy of detection and adaptability to complex situations. On its homemade dataset, the algorithm achieved an average detection accuracy of 92.1%, which is 2.7% higher compared to YOLOv8n, but the number of parameters was increased by 20% and the detection speed was slightly reduced.

While it is acknowledged that numerous scholars have achieved varied results with different detection models, the YOLO series stands out for its well-documented balance of rapid detection speed and high accuracy. These characteristics align seamlessly with the real-time requirements of safety equipment detection, and the flexibility in model sizing allows for effective adaptation across various device storage capabilities. Therefore, in order to address the problems of low detection efficiency, large model size, and the difficult deployment of wearable detection models for electric-power safety protection equipment, this paper proposes a high-efficiency detection model, ML-YOLOv8n-Light, which is optimized on the basis of the newest model of the YOLO series, YOLOv8n. This upgraded model not only has higher detection efficiency, but also has a smaller size and is easier to dismantle. Compared with the original model, the mAP50 of ML-YOLOv8n-Light is improved by 2.1%, mAP50-95 by 2.8%, the number of parameters is reduced by 29%, the computational requirement is reduced by 18%, and the model size is compressed by 23%. In addition, the number of long-range misses, occlusion misses, and false positives are dramatically reduced, resulting in more efficient allocation of computational resources.

2. YOLOv8

YOLOv8 [14], proposed by Glenn-Jocher, is in the same lineage as YOLOv3 and YOLOv5 [15], and is the latest model in the current YOLO [16,17,18,19,20,21,22] series.

According to the size of the model, YOLOv8 is divided into YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l, and YOLOv8x. As shown in Figure 1, the network structure of YOLOv8 is divided into three parts: the backbone network, the feature fusion structure, and the detection head.

The backbone network is mainly used for image feature extraction, and three feature maps of different scales are passed into the feature fusion structure, including CBS, cross-stage partial bottleneck with two convolutions (C2f), spatial pyramid pooling fast (SPPF) [23]. The cross-stage partial network CSP module in YOLOv5 is replaced with a lightweight C2f module, which enhances the feature expression through residual structure, and is lightweight while having rich gradient-flow information; the SPPF module is used to extract spatial feature information of different sizes.

The feature fusion structure (neck) is used to fuse the received feature maps at different scales, and the fusion method is extended from the feature pyramid networks (FPNs) and the path aggregation network (PAN) used in YOLOv5, but the 1 × 1 convolutional block before upsampling is removed.

The detection head (head) is replaced by the current mainstream decoupling head, which separates classification and regression, as well as from the anchor-based to anchor-free; the positive–negative sample allocation strategy references the task-aligned-assigner of TOOD [24]; the classification loss function follows the focal loss [25] of YOLOv5. The regression loss function uses CIOU loss [26].

3. ML-YOLOv8n-Light Model Construction

3.1. Designing Lightweight Multiscale Convolutional Block MSGC

In this paper, the dataset labels the size situation statistics. The results are shown in Figure 2: the horizontal and vertical coordinates represent the ratio of the width and height of the label box to the whole picture, respectively; the larger the lower-left label box, the label box is smaller, and vice versa, the smaller the upper-right label box, the label box is larger. From the figure, we can see that there are more small targets clustered in the lower left, which is because the helmet, insulated gloves, and cuffs are smaller and account for a higher percentage in the dataset. In addition, there are more targets of larger sizes, which is due to the larger size of the targets such as the workers wearing insulated suits and unrelated personnel also representing a larger percentage in the dataset. This results in the detection of the target scale spanning a larger size, and the traditional convolutional operations such as the widely used 3 × 3 convolution cannot effectively capture the variability in the target scale, leading to unsatisfactory detection performance.

To effectively address this issue, this paper has designed a multi-scale group convolution technique called multi-scale grouped convolution (MSGC). The crux of MSGC is to target features of varying scales by employing diverse convolution kernel sizes through channel grouping. Specifically, the input feature map is evenly divided into four groups along the channel dimension, and each group is convolved with a different scale of convolution kernels (1 × 1, 3 × 3, 5 × 5, and 7 × 7) to obtain feature maps corresponding to various object scales. Subsequently, the point convolution operations using 1 × 1 convolution kernels are applied to aggregate these feature maps from the four scales, creating the final comprehensive feature map. Integrating MSGC into this section’s detection model enhances the adaptability to objects of different sizes, thus improving detection performance on the dataset being discussed. MixConv [27] is very similar to our work, which also mixes convolution kernels of different sizes into a single layer of convolution to obtain feature maps at different scales; however, our MSGC takes a key step forward in the effective aggregation of feature maps, which is not available in MixConv. MSGC takes full advantage of the complementarities among multi-scale features through the subsequent integration operation, which can effectively preserve and enhance the feature information at different scales.

To assess the computational efficiency of MSGC, we will next quantitatively compare the number of parameters and the computational cost of MSGC to the conventional 3 × 3 convolution. Normally, the parametric quantity of convolution is the product of the convolution kernel height, convolution kernel width, the number of input and output channels of the feature map, and the computational quantity of convolution is usually considered the number of multiplication and addition operations required for convolution, which can be approximated as the product of the height of the feature map, the width of the feature map, and the parameters of convolution in the circumstances where the parameters of convolution are known.

Assuming the feature map has a height

H

, width

W

, input channel count

C_{i n}

, and output channel count

C_{o u t}

, the parameter count for the different sizes of convolutions within MSGC would be as follows:

{p a r a m s}_{1 \times 1} = 1 \times 1 \times \frac{C_{i n}}{4} \times \frac{C_{o u t}}{4}

(1)

{p a r a m s}_{3 \times 3} = 3 \times 3 \times \frac{C_{i n}}{4} \times \frac{C_{o u t}}{4}

(2)

{p a r a m s}_{5 \times 5} = 5 \times 5 \times \frac{C_{i n}}{4} \times \frac{C_{o u t}}{4}

(3)

{p a r a m s}_{7 \times 7} = 7 \times 7 \times \frac{C_{i n}}{4} \times \frac{C_{o u t}}{4}

(4)

In addition, the number of parameters for the pointwise convolution of the final 1 × 1 convolution kernel of MSGC is as follows:

{p a r a m s}_{p - w c o n v} = 1 \times 1 \times C_{i n} \times C_{o u t}

(5)

{P a r a m s}_{M S G C} = \frac{1}{16} \times (1 + 9 + 25 + 47) \times C_{i n} \times C_{o u t} + {C_{i n} \times C}_{o u t}

(6)

{P a r a m s}_{M S G C} = (\frac{84}{16} + 1) \times C_{i n} \times C_{o u t}

(7)

{P a r a m s}_{M S G C} = 6.25 \times C_{i n} \times C_{o u t}

(8)

Finally, the number of parameters and computation ratio of MSGC and traditional 3 × 3 convolution are the following:

Params Ratio = \frac{{P a r a m s}_{M S G C}}{{P a r a m s}_{3 \times 3}} = \frac{6.25 \times C_{i n} \times C_{o u t}}{3 \times 3 \times C_{i n} \times C_{o u t}} = \frac{6.25}{9} \approx 0.69

(9)

FLOPs Ratio = \frac{H \times W \times {P a r a m s}_{M S G C}}{{H \times W \times P a r a m s}_{3 \times 3}} = \frac{{P a r a m s}_{p c o n v}}{{P a r a m s}_{3 \times 3}} \approx 0.69

(10)

By comparing and analyzing the number of parameters and computation amount, this paper finds that MSGC reduces the number of parameters and computation amount by about 31% compared with the traditional 3 × 3 convolution. However, its frequent switching of convolution kernels of different sizes in practical applications will cause the computational complexity of the model to rise when it is actually running. In order to balance the efficiency and performance of the model, this paper does not design MSGC as the main module of feature extraction, but as an auxiliary multi-scale feature extraction means into the bottleneck structure of the C2f module, to construct the MSGC-bottleneck structure, as shown in Figure 3.

3.2. Design Lightweight Adaptive Weight Downsampling (LAWD)

Traditional downsampling methods, such as maximum pooling and average pooling, usually reduce the spatial resolution of the feature map in a predetermined and invariant way to reduce the computation and avoid overfitting. However, pooling downsampling cannot adjust the weights according to the content or importance of the pixel points, which tends to lose important information, the feature selection is blind, and important spatial feature information may be lost due to downsampling. Furthermore, with stepwise convolution, using a step size greater than 1 not only reduces the spatial resolution of the feature map, but also as opposed to the pooling layer, it can better learn valuable features while downsampling, avoiding the loss of a large quantity key information. However, relying too much on convolutional layers for downsampling leads to a dramatic increase in the parameters and computation of the model, which constitutes a significant disadvantage when deployed on resource-constrained mobile devices.

After considering the limitations of traditional downsampling methods, this paper proposes a novel downsampling structure, lightweight adaptive weight downsampling (LAWD). LAWD aims to deal with the important differences between pixel points more finely, allowing the network to adaptively adjust the proportion of weights according to the contents of different pixel points, thus reducing the spatial dimension while retaining important feature information, and with smaller parameters and computational effort compared to stepwise convolution. The structure consists of two parallel branches, as shown in Figure 4.

The first branch is responsible for generating an adaptive weight map, a process that begins with the average pooling of the input feature map to obtain a feature mapping with reduced spatial resolution. Subsequently, a 1 × 1 convolution with a constant number of output channels is applied to generate pre-determined weights, and finally the weight distribution of each pixel point is output via a softmax activation function. The size of the input feature map is assumed to be b,c,h,w, where b,c,h,w denote the batch, number of channels, height, and width of the feature map, respectively. The first branch converts the dimensions of the input feature map from b,c,h,w to b,c,h/2,w/2,4. The channel sub-dimension of the weight map is designed to be four in order to be consistent with the shape the second branch’s output for subsequent element-by-element multiplication operations.

The second branch utilizes group convolution to preserve spatial information into the channels, simulating the advantages of the focus structure and avoiding its complex slicing operations. In this branch, the group convolution kernel is 3 × 3, the feature maps are grouped evenly into 8 channels per group, and spatial downsampling is performed with a step size of 2, while expanding the number of channels to quadruple the number of inputs. From this, the second branch also obtains a feature map with dimensions of b,c,h/2,w/2,4.

After the two branches have completed their respective operations, the adaptive weights of the corresponding pixel points are multiplied with the corresponding elements of the group convolutional output feature map, and then summed along the channel sub-sizes in order to merge the features using the adaptive weight information for adaptive downsampling. The final output feature map size is b,c,h/2,w/2 to achieve the downsampling effect.

Next, compared with some similar work to LAWD, the authors of [28] introduced an adaptive downsampling scheme so that information-rich areas have a higher resolution than areas with less information. This method emphasizes changing the resolution according to the information content. LAWD focuses on adjusting weights based on pixel importance rather than frequently adjusting resolution. It uses a more consistent downsampling method, making the learning process more stable. SliceSamp [29] uses slice and depth-separable convolution on the feature map to enlarge the feature map channel to four times the original size, and finally uses 1 × 1 convolution to aggregate the feature map channel information, while LAWD only uses group convolution complete-channel amplification, making it more concise. The authors of [30] used spatial adaptive computing to execute different layers according to different areas, emphasizing the calculation of important or complex areas of the image at high levels. LAWD pays more attention to the importance between feature image pixels. The study at [31] focused on simulating real-world downsampling modes to achieve better super-resolution image generation. However, this method needs to learn from LR and HR images, and LAWD does not require such paired examples.

Finally, we analyzed the comparison between LAWD and step convolution in terms of the number of parameters and computation, also assuming that the feature map has a height

H

, width

W

, input channel count

C_{i n}

, and output channel count

C_{o u t}

, then a convolution kernel size of 3 × 3, and the number of parameters of step convolution of step size 2 is as follows:

{P a r a m s}_{S C o n v} = 3 \times 3 \times C_{i n} \times C_{o u t}

(11)

Because the step-length of step convolution is 2, the number of calculations is halved when the convolution is calculated by moving on the height and width of the feature map, then the calculation of step convolution is:

{F L O P s}_{S C o n v} = \frac{H}{2} \times \frac{W}{2} \times 3 \times 3 \times C_{i n} \times C_{o u t}

(12)

Next, calculate the parametric quantity of the first branch of LAWD’s pooling and softmax activation function, there is no trainable parameter, so the parametric quantity of the first branch is as follows:

{P a r a m s}_{L A W D 1} = 1 \times 1 \times C_{i n} \times C_{o u t}

(13)

Since pooling and softmax only need to perform fixed computation on all channels of the feature map, compared with convolution the computational amount is very small; this paper ignores the computational amount of pooling and softmax, so the computational amount of the first branch is the following:

{F L O P s}_{L A W D 1} = H \times W \times 1 \times 1 \times C_{i n} \times C_{o u t}

(14)

Then, calculate the number of parameters of the second branch, it is only necessary to calculate the number of parameters of the group convolution. In this paper, we used the number of channels per group convolution which was 8, assuming that group is the number of grouping of the group convolution, then:

g r o u p = \frac{C_{i n}}{8}

(15)

The number of input channels per group was 8, and since the number of channels needs to be expanded to four times the number of inputs, the number of output channels per group was 32. Then, the number of parameters for the group convolution of the second branch was the number of groups multiplied by the number of parameters per group:

{P a r a m s}_{L A W D 2} = g r o u p \times 3 \times 3 \times 8 \times 32

(16)

{P a r a m s}_{L A W D 2} = \frac{C_{i n}}{8} \times 3 \times 3 \times C_{i n} \times C_{o u t}

(17)

The convolution step of this group is 2. Then, the computation of the second branch is:

{F L O P s}_{L A W D 2} = \frac{H}{2} \times \frac{W}{2} \times {P a r a m s}_{L A W D 2}

(18)

{F L O P s}_{L A W D 2} = \frac{H}{2} \times \frac{W}{2} \times \frac{C_{i n}}{8} \times 3 \times 3 \times 8 \times 32

(19)

Add the two branches’ parameter quantity to obtain the parameter quantity of LAWD:

{P a r a m s}_{L A W D} = {P a r a m s}_{L A W D 1} + {P a r a m s}_{L A W D 2}

(20)

{P a r a m s}_{L A W D} = 1 \times 1 \times C_{i n} \times C_{o u t} + \frac{C_{i n}}{8} \times 3 \times 3 \times 8 \times 32

(21)

{P a r a m s}_{L A W D} = (\frac{288}{C_{o u t}} + 1) \times C_{i n} \times C_{o u t}

(22)

Add the two branches and ignore the pixel-by-pixel multiplication to obtain the approximation of LAWD:

{F L O P s}_{L A W D} = H \times W \times 1 \times 1 \times C_{i n} \times C_{o u t} + \frac{H}{2} \times \frac{W}{2} \times \frac{C_{i n}}{8} \times 3 \times 3 \times 8 \times 32

(23)

{F L O P s}_{L A W D} = \frac{H}{2} \times \frac{W}{2} \times (\frac{288}{C_{o u t}} + 4) \times C_{i n} \times C_{o u t}

(24)

Finally, the parametric and computational ratios of LAWD and ordinary step size convolution are as follows:

Params Ratio = \frac{{P a r a m s}_{L A W D}}{{P a r a m s}_{3 \times 3}} = \frac{(\frac{288}{C_{o u t}} + 1) \times C_{i n} \times C_{o u t}}{3 \times 3 \times C_{i n} \times C_{o u t}} = \frac{(\frac{288}{C_{o u t}} + 1)}{9}

(25)

FLOPs Ratio = \frac{{F L O P s}_{M S G C}}{{F L O P s}_{3 \times 3}} = \frac{\frac{H}{2} \times \frac{W}{2} \times (\frac{288}{C_{o u t}} + 4) \times C_{i n} \times C_{o u t}}{\frac{H}{2} \times \frac{W}{2} \times 3 \times 3 \times C_{i n} \times C_{o u t}} = \frac{(\frac{288}{C_{o u t}} + 4)}{9}

(26)

From Equations (25) and (26), it can be seen that the parameter ratio and computation ratio of LAWD to the ordinary step-length convolution are related to the number of output channels

C_{o u t}

of the feature map. According to the calculated parameters ratio, when the output feature channel number

C_{o u t}

is greater than 36, the parametric quantity of LAWD is smaller than the step-length convolution, and the larger

C_{o u t}

is, the smaller is the parametric quantity of LAWD. Most of the time, the number of feature map channels was greater than 36 in the structure of the target detection network used in this paper. According to the calculated FLOPs ratio, when the output feature map channel number

C_{o u t}

is greater than 58, the computation of LAWD is smaller than the step convolution, and the larger

C_{o u t}

is, the smaller is the computation of LAWD. In the target detection network structure used in this paper, most of the time, the channel number of the feature map was greater than 58. Therefore, LAWD can achieve the effect of adaptive adjustment of the weight proportion of the downsampling in the network structure in this section with a small number of parameters and computation amount. In addition, LAWD can dynamically adjust the channel grouping of group convolution (in this paper, the number of channels in each group was 8) according to different practical application scenarios and network model structures, in order to achieve a better balance between the computational resources and the detection effect.

3.3. Lightweight Module Design and Addition of Small Target Detection Layer

The SPPF module obtains feature maps with different scale information through the maximum pooling of three tandem connections as well as multiple dense connections, but the pooling operation is not learnable, and in order to enhance the feature extraction capability, the [22] spatial pyramid pooling and fully spatial pyramid convolution (SPPFCSPC) module was used. The structure of SPPFCSPC is shown in Figure 5, where multiple convolutional blocks are added to the SPPF, which becomes learnable and can better extract and fuse features at different scales. However, the parameters of SPPFCSPC are huge and complicated to compute, and the module is generally placed at the end of the backbone convolutional network, with a larger network depth and more feature map channels, which may carry similar or the same feature information in more than one channel of the feature map. Therefore, in this paper, we introduced a partial convolution proposed by the lightweight backbone network faster neural networks [32], i.e., the idea of PConv, which replaces the two 3 × 3 ordinary convolution blocks of SPPFCSPC with 3 × 3 PConv, such as in Figure 6. This is the structure of PConv, which performs ordinary convolution only on some of the channels among all the channels of the input feature maps, and the remaining channels that are not convolved. The remaining channels that are not convolved are saved. This greatly reduces the number of parameters and computation load of SPPFCSPC. Finally, the feature layer to be processed is channel-compressed, and the channels are restored after multi-scale feature extraction by SPPFCSPC, which reduces the redundancy degree of the feature map, further reduces the number of parameters of SPPFCSPC, and finally obtains the structure SPPFCSPC-Light used in this paper.

In addition, YOLOv8 uses the current mainstream-decoupled detection header, and by separating the pixel-level prediction from feature extraction, it can better handle feature information of different scales and degrees of fineness, and obtain better detection results. However, as seen inside the detect module in Figure 1, the classification and regression branches each have two 3 × 3 convolutional blocks, such that the total number of references and the computation volume of the three detect modules are extremely large. In addition, the total number of parameters and computation of the three detect modules are extremely large. In this paper, we lightened the detect module of YOLOv8, as shown in Figure 7, which is the structure of the lightened detect module Detect-Light, using a 3 × 3 PConv and a 1 × 1 convolutional block to replace the original detect module’s four 3 × 3 convolutional blocks, and then separated it into classification and regression branches, so that the original decoupling header can be inherited to bring accurate detection results, which can inherit the accurate detection effect brought about by the original decoupling header, and also make the model more lightweight and efficient in resource utilization.

Finally, the PSAD exists as a large proportion of small targets, in order to improve the effect of small target detection, while ensuring that the detection accuracy of larger targets in the data set is not affected. In this paper, the shallow feature map containing rich information is introduced into the feature fusion structure, and an additional small target detection head is added, and finally the safety protective equipment wear detection model under the electric-power scene operation used in this paper is ML-YOLOv8-Light, and its network structure is shown in Figure 8.

4. Experimental Design and Analysis of Results

4.1. Power Safe Attire Dataset

In this paper, the power safe attire dataset (PSAD) is constructed that specializes in the detection of multiple protective equipment in power operation scenarios.

The PSAD collects photos of workers in typical power operational environments, such as substations and transmission towers, which reflect the wearing status of workers in actual operational environments. The construction of PSAD takes into account the diversity and practicability of the images, and the images come from a wide range of sources and cover power operation environments from various shooting angles. A total of 6000 high-quality images were collected in this dataset, and the PSAD was divided into training, validation, and testing sets according to the ratio of 8:1:1. The key safety protection equipment in the images was accurately labeled, and the labeling is shown in Figure 9, where the horizontal coordinate is the number of labels.

The labeling of the dataset was divided into five categories: ‘worker’: a worker wearing suitable insulated clothing and insulated pants that can effectively deal with potential hazards such as electric shocks and arc flash explosions; ‘Badge’: the worker’s cuff, which can identify the worker more quickly and easily; ‘safety hat’: a safety hat worn correctly on the head of the worker, which is the most basic safety requirement in the field of electric power operation; ‘glove’: marking the correctly worn insulated glove, which is essential to prevent the risk of electric shock when workers come into contact with high-voltage electrical installations; and ‘irrelevant personnel’: unrelated personnel who are not wearing the appropriate safety gear, which helps minimize safety hazards and disruptions.

4.2. Experimental Environment and Parameter Settings

The hardware configuration, software environment, and core parameter settings used for the experiments in this paper are shown in Table 1 and Table 2.

4.3. Evaluation Metrics

This experiment uses precision, recall and mean average precision (mAP) to measure the model detection precision, which is calculated as follows:

P r e c i s o n = \frac{T P}{T P + F P}

(27)

R e c a l l = \frac{T P}{T P + F N}

(28)

m A P = \frac{1}{n} \sum_{i = 1}^{n} \int_{0}^{1} P (R) d (R)

(29)

where TP denotes the number of correct targets in the detection results; FP denotes the number of incorrect targets in the detection results; and FN denotes the number of missed targets in the correct targets.

In addition, the model complexity is measured using the number of model parameters (params), computation (GFLOPs), and model size (model-size), and the number of detected pictures per second (FPS) measures the model detection speed.

Because the detection accuracy of the model does not change significantly after the introduction of MSGC and LAWD, in order to verify that the smaller accuracy change is not a result of the experimental change but a function of the algorithmic improvement, the experimental results in Section 4.4 and Section 4.5 are the mean values obtained from five repetitions of the experiment.

4.4. Effect of Introducing MSGC on Model Performance

The experimental results are shown in Table 3, where C2f of YOLOv8n is replaced with C2f-MixConv integrating different combinations of convolutional kernels of different sizes to obtain models such as YOLOv8n+C2f_MixConv(1,3) in the table, and further 1 × 1 convolutional tallying of different channels in YOLOv8n+C2f_MixConv(1,3,5,7) is used to obtain the feature information of the YOLOv8n+C2f_MSGC model in the table. As can be seen in Table 3, the YOLOv8n+C2f_MixConv(1,3,5,7) model with different sizes of convolutional combinations has the best performance, which verifies the validity of the 1,3,5,7 convolutional combinations, and the YOLOv8n+C2f_MSGC has an improvement of 0.9% in comparison with the original model mAP50, and an improvement of 0.6% in comparison with mAP50-95; the improvement is obvious, which verifies the rationality of the 1 × 1 convolution to unify different channel feature information, and the number of parameters, computations, and model size are slightly decreased, which verifies the lightweight of MSGC compared with the ordinary convolution; however, the detection speed is also slightly decreased, which verifies the effect of switching different sizes of convolutions on the actual computational complexity of the model.

The comparison of the accuracy of each specific detection target before and after the improvement is shown in Figure 10, from which it can be seen that the accuracy of the safety hat with the centered size is reduced by 0.4%, and the MSCG replaces the traditional single-scaled 3 × 3 convolution by adopting a multi-scale convolution in order to enhance the ability of feature capturing for targets of different scales. However, in the improved convolution design, the convolution kernel at each scale only acts on a portion of the channels, resulting in a significant reduction in the number of channels that were originally intact for the 3 × 3 convolution to capture medium-sized features. This change reduces the network’s ability to capture medium-sized features, causing a decrease in the safety hat accuracy. Nonetheless, the accuracy of the other detection targets at different scales is improved, verifying the better adaptability of MSGC for the detection of targets at different scales.

4.5. Effect of Introducing LAWD on Model Performance

The experimental results are shown in Table 4, using only average pooling for downsampling to obtain the YOLOv8n+Avgpool model in the table, using only group convolution with step size 2 for downsampling to obtain the YOLOv8n+GroupConv model in the table, and replacing the step convolution for downsampling with LAWD for YOLOv8n to obtain the YOLOv8n-LAWD model in the table. As can be seen from Table 4, using only pooling or group convolution for downsampling, the number of model parameters and computation amount are significantly reduced, and the model is lighter, but the detection accuracy is also reduced, and the downsampling effect is not as good as that of the original model’s step-length convolution, whereas the introduction of LAWD improves the model’s mAP50 by 0.4%, and mAP50-90 by 0.3%, which verifies the better downsampling capability of LAWD; the number of parameters, computation amount, and model size are all reduced, and the model size of YOLOv8n-LAWD is reduced. The number, computation, and model size are also all reduced to a certain extent, verifying that LAWD is lighter compared to step-length convolution; meanwhile, the detection speed is slightly decreased, which is due to the design of the two branches of LAWD, which makes the actual computational complexity of LAWD larger.

The comparison of the accuracy of each target before and after the improvement is shown in Figure 11, from which we can see that the accuracy of glove and irrelevant personnel with lower detection accuracy changes more significantly by 1.0% and 0.9%, respectively, while the accuracy of the other three targets with a higher detection accuracy does not change significantly, indicating that using LAWD downsampling, the model adaptively retains more features of the difficult-to-detect targets, thus balancing the overall performance of the model for various types of target detection to a certain extent, in order to achieve a higher average detection accuracy of the model, which verifies the better adaptability of LAWD compared with traditional downsampling.

4.6. Effect of Small Target Detection Layer and Lightweight Module on Model Performance

The experimental results are shown in Table 5, in which the small target detection layer is added to obtain the YOLOv8n-Small model in the table, then continue to replace SPPF with SPPFSCPC-Light to obtain the YOLOv8n-S_Light1 model in the table, and instead of replacing SPPF, detect-light is used to replace detect to obtain the YOLOv8n-S_Light2 model, and introduce both SPPFSCPC-Light and Detect-Light to YOLOv8n-Small to obtain the YOLOv8n-S_Light model used in this paper in the table. From Table 5, we can see that adding an additional small target detection layer improves mAP50 by 1.2% and mAP50-95 by 1.3%, but at the same time, because of the larger size of the shallow feature maps, the introduction of these feature maps into the detection consumes higher computational resources, and the computation of the model is increased by 53%, and the speed of the detection is slightly reduced. This verifies the advantages and disadvantages of the addition of an additional small target detection layer; the SPPFSCPC-Light module continues to be introduced, the detection accuracy of the model is further improved, while only a tiny quantity of the parameters and computation is introduced; on introducing detect-light instead of SPPFSCPC-Light, there is a slight decrease in the accuracy of the model compared to YOLOv8n-Small, while the number of parameters in the model is reduced by 10%, the amount of computation is drastically reduced by 43%, and the detection speed increased by 30%, which effectively reduced the complexity of the model; meanwhile, introducing SPPFSCPC-Light and detect-light, compared with the original model, mAP50 improved by 1.4%, and mAP50-95 improved by 2.2%, while keeping a lower parameter quantity and computation, and the detection speed was also slightly improved, which verified the efficient detection performance of SPPFSCPC-Light and detect-light on the dataset of this paper.

The comparison of the accuracy of each specific detection target before and after the improvement is shown in Figure 12, from which it can be seen that the accuracy of the glove and badge, which have the smallest detection target sizes, is most obviously improved by 3.0% and 1.8%, respectively, which verifies the effectiveness of the addition of an extra small-target detection layer for small-target detection. Meanwhile, the accuracy of the safety hat, which is in the center of the size, is improved by 0.7%, the accuracy of the worker and irrelevant personnel, which have the largest size, is also improved by 0.1% and 0.6%, respectively, which verifies that the additional small target detection layer does not weaken the model’s ability to detect targets of larger sizes.

4.7. Ablation Experiment

The experimental results are shown in Table 6, where × denotes that the corresponding improvement is not introduced on the basis of YOLOv8n, and √ denotes that the corresponding improvement is introduced. From which it can be seen that when C2f-MSGC and LAWD are introduced at the same time, the mAP50 of the model is improved by 1.1%, the mAP50-95 is improved by 1.0%, the number of parameters is reduced by 20%, the amount of computation is reduced by 7%, and the size of the model is compressed by 17%, with the detection speed being slightly reduced; finally, when S-Light is introduced, the detection accuracy of the model is further improved to reach the optimum, mAP50 is improved by 2.1%, mAP50-95 is improved by 3.1%, and the volume of the model is further reduced to reach the minimum, compared with the original model, the amount of parameters is reduced by 29%, the amount of computation is reduced by 18%, and the size of the model is compressed by 23%, and the detection speed is slightly reduced, but it is enough to satisfy the demands of real-time detection, which verifies the reasonableness of the fusion of multiple improvement methods in this paper. The model size is compressed by 23%. Before and after the improvement in each specific detection target accuracy comparison are shown in Figure 13.

4.8. Horizontal Comparison

As presented in Table 7, we conducted a comprehensive comparative analysis where we benchmarked ML-YOLOv8n-Light against the two-stage representative algorithm faster R-CNN [33] and a selection of contemporary, mainstream YOLO series detection models. This evaluation was undertaken while maintaining identical experimental conditions, including the experimental setup, parameter configurations, datasets, and data preprocessing methods. As evidenced in Table 7 ML-YOLOv8n-Light has outperformed all competing models in every metric except frames per second (FPS). Notably, the FPS of ML-YOLOv8n-Light is superior to that of most models and only marginally lower than the original YOLOv8n.

It is important to articulate that while the detection speed does experience a slight reduction, this decrease remains within acceptable limits, particularly when considering the deployment on mobile devices with constrained computational resources. The performance of ML-YOLOv8n-Light in terms of detection speed, despite the reduction, is proven to be well within the threshold required for the real-time detection of safety protective equipment. Therefore, the slight compromise in FPS can be seen as a trade-off for achieving optimal performance across all other crucial metrics, one that is justifiable and manageable within the practical confines of real-world applications where accuracy and reliability are paramount.

4.9. Comparison of Actual Detection Results

Figure 14 shows the comparison results of the three groups of actual detection scenarios, where YOLOv8n is on the left and ML-YOLOv8n-Light is on the right, the first group is the comparison of leakage detection of small targets at a distance, the third group is the comparison of occlusion leakage detection, and the last two groups are the comparison of false detection. From the first two groups of detection pictures in the figure, it can be seen that the original YOLOv8n easily misses the detection of the smaller detection target badge and glove in the distance, while ML-YOLOv8n-Light obviously reduces this leakage, which verifies the validity of the improvement in ML-YOLOv8n-Light to improve the detection effect of the small target; from the second group of detection picture comparison results, it can be seen that YOLOv8n is prone to miss detection when the detection target is occluded; meanwhile, ML-YOLOv8n-Light can also have a better detection effect when the target has some occlusion, which indicates that ML-YOLOv8n-Light’s detection ability for the occluded target is also superior compared to that of YOLOv8n; from the last set of detection picture comparison results, we can see that YOLOv8 incorrectly detects the helmet and shoes as a glove, while ML-YOLOv8n-Light obviously reduces this kind of misdetection, which verifies the detection accuracy and robustness of ML-YOLOv8n-Light.

4.10. Feature Map Visualization Comparison

In computer vision research, visualization of feature maps is a key means of understanding and analyzing how the model processes and understands image feature information. As shown in Figure 13, the experimental results demonstrate, through the intuitive visualization, the significant difference between the extraction and processing of feature information by the model before and after the improvement. Where the left big picture is the original image being detected, the right small picture is the change in the same channel of the different levels of feature maps visualized during the detection process, from left to right sequentially transitioning from the shallow feature maps to the deeper special maps. The upper four are the change in the feature maps of YOLOv8n, and the lower four are the change in the feature maps of ML-YOLOv8-Light.

As can be seen in Figure 15, YOLOv8n only slightly focuses on the workers during the change in the feature map and does not form a clear focus, indicating that YOLOv8n is not sufficiently focused on extracting the features, which results in it not being able to adequately understand and differentiate between the target and the background to a certain extent. Compared with YOLOv8n, ML-YOLOv8-Light in this section starts to focus on the background information and the worker goal to a certain extent from the shallow level, and gradually strengthens the focus on the worker goal while gradually ignoring the background information as the level goes deeper, which indicates that ML-YOLOv8-Light is more efficient in feature extraction, clearer, more focused on feature representation, and able to focus on task-relevant information points more quickly. The efficient allocation of computational resources by ML-YOLOv8-Light is verified, which leads to the reduction in model volume while improving detection accuracy.

5. Conclusions

In this paper, a multi-safety protective equipment detection model ML-YOLOv8n-Light was designed for electric power scenario operations. Firstly, a lightweight multi-scale convolutional block MSGC was designed to improve the model’s adaptability to the wide span of multi-safety protective equipment sizes; then, a lightweight adaptive weight downsampling (LAWD) mechanism was designed to improve the traditional pooling and step-length convolutional downsampling limitations; finally, two lightweight modules were designed to improve the detection efficiency of the model and add a small target detection layer to improve the detection of small targets. Following experiments using the dataset PSAD, the results show that ML-YOLOv8n-Light improves mAP50 by 2.1% and mAP50-95 by 3.1% compared with the original model, reduces the amount of parameters by 29%, reduces the amount of computation by 18%, compresses the model size by 23%, reduces the number of long-distance leakage, occlusion leakage, and misdetection, allocates the computational resources more efficiently and outperforms the other mainstream detection models on a number of metrics. The model size is compressed by 23%. In the future, we will continue to expand and deepen the image library for the detection of safety protective equipment, add more environmental conditions, different wearing styles, and diversified character postures, and label the data more professionally according to the needs of different detection scenarios (e.g., insulated gloves in electric power scenarios, if a worker is compliant with wearing both insulated gloves, it will be a positive sample, and if he/she wears only one insulated glove or does not wear one, it will be a negative sample) in order to increase the model’s adaptability to the complexity of the real world.

Author Contributions

All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by X.L. and Y.L. The first draft of the manuscript was written by X.L. and all authors commented on previous versions of the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, grant number 61962031; the Yunnan Provincial Department of Science and Technology Major Science and Technology Special Program, grant number 202302AD080002 and the Yunnan Province Education Department Scientific Research Fund Project, grant number 2023J0146.

Data Availability Statement

The data in this paper are undisclosed due to the confidentiality requirements of the data supplier.

Acknowledgments

We thank Yunnan Electric Power Research Institute for collecting the construction pictures of electrical workers, which provided a solid foundation for the validation of the model proposed in this paper. We also thank the reviewers and editors for their constructive comments to improve the quality of this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhou, Y.H. Demand Analysis of Individual Protective Equipment Allocation in the Electric Power Industry. CPPE 2023, 42–44. [Google Scholar]
Li, X.; Wang, Y.; Qu, J.; Wang, W.; Xu, X.; Jin, Y.; Li, Y. Intelligent Monitoring Method of Helmet Wearing Identification in Power Plants. In Proceedings of the 2023 8th Asia Conference on Power and Electrical Engineering (ACPEE), Tianjin, China, 14–16 April 2023; pp. 1699–1703. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar] [CrossRef]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Feng, P.Y.; Qian, Y.R.; Fan, Y.Y.; Wei, H.Y.; Qin, Y.G.; Mo, W.H. Safety Helmet Detection Algorithm Based on Improved Cascade R-CNN. Microelectron. Comput. 2024, 41, 63–73. [Google Scholar] [CrossRef]
Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving into High-Quality Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 6154–6162. [Google Scholar]
Ma, J.; Chen, L.; Ren, B.; Zhu, S.X. A STDAB-DETR Model for Secure Detection of Electrical Construction Sites. In Proceedings of the 2023 13th International Conference on Power and Energy Systems (ICPES), Chengdu, China, 8–10 December 2023; pp. 207–212. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ru, W.; Daming, L.; Jian, Z. Wear-YOLO: A Study on Safety Equipment Detection Method for Power Personnel in Substation. Computer Engineering and Applications. 2024. Available online: http://kns.cnki.net/kcms/detail/11.2127.TP.20240104.0926.004.html (accessed on 28 April 2024).
Wadekar, S.N.; Chaurasia, A. Mobilevitv3: Mobile-friendly Vision Transformer with Simple and Effective Fusion of Local, Global and Input Features. arXiv 2022, arXiv:2209.15159. [Google Scholar]
Tong, Z.; Chen, Y.; Xu, Z.; Yu, R. Wise-IoU: Bounding Box Regression Loss with Dynamic Focusing Mechanism. arXiv 2023, arXiv:2301.10051. [Google Scholar]
Ultralytics. YOLOv8. Available online: https://github.com/ultralytics/ultralytics (accessed on 11 August 2023).
Ultralytics. YOLOv5 🚀 in PyTorch > ONNX > CoreML > TFLite. Available online: https://github.com/ultralytics/yolov5 (accessed on 11 August 2023).
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the IEEE Conference on Computer Vision & Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [Google Scholar]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 101–104. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO Series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Li, C.Y.; Li, L.L.; Jiang, H.L.; Weng, K.H.; Geng, Y.F.; Li, L.; Ke, Z.D.; Li, Q.Y.; Cheng, M.; Nie, W.Q.; et al. YOLOv6: A Single-stage Object Detection Framework for Industrial Applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable Bag-of-freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 37, 1904–1916. [Google Scholar] [CrossRef] [PubMed]
Feng, C.; Zhong, Y.; Gao, Y.; Scott, M.R.; Huang, W. TOOD: Task-aligned One-stage Object Detection. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Virtual, 11–17 October 2021; pp. 3490–3499. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.M.; Dollar, P. Focal Loss for Dense Object Detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Zheng, Z.; Wang, P.; Ren, D.; Liu, W.; Ye, R.; Hu, Q.; Zuo, W. Enhancing Geometric Factors in Model Learning and Inference for Object Detection and Instance Segmentation. IEEE Trans. Cybern. 2020, 52, 8574–8586. [Google Scholar] [CrossRef] [PubMed]
Tan, M.; Le, Q.V. MixConv: Mixed Depthwise Convolutional Kernels. arXiv 2019, arXiv:1907.09595. [Google Scholar]
Hesse, R.; Schaub-Meyer, S.; Roth, S. Content-Adaptive Downsampling in Convolutional Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 4543–4552. [Google Scholar]
He, L.; Wang, M. SliceSamp: A Promising Downsampling Alternative for Retaining Information in a Neural Network. Appl. Sci. 2023, 13, 11657. [Google Scholar] [CrossRef]
Figurnov, M.; Collins, M.D.; Zhu, Y.; Zhang, L.; Huang, J.; Vetrov, D.; Salakhutdinov, R. Spatially Adaptive Computation Time for Residual Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1039–1048. [Google Scholar]
Son, S.; Kim, J.; Lai, W.-S.; Yang, M.-H.; Lee, K.M. Toward Real-World Super-Resolution via Adaptive Downsampling Models. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 8657–8670. [Google Scholar] [CrossRef] [PubMed]
Chen, J.; Kao, S.; He, H.; Zhuo, W.; Wen, S.; Lee, C.; Chan, S.G. Run, Don’t Walk: Chasing Higher FLOPS for Faster Neural Networks. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 12021–12031. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Proceedings of the Annual Conference on Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; Volume 28. [Google Scholar]

Figure 1. YOLOv8 network structure.

Figure 2. Dataset label size statistics.

Figure 3. MSGC-bottleneck structure diagram.

Figure 4. LAWD structure.

Figure 5. SPPFCSPC Structure.

Figure 6. PConv structure.

Figure 7. Detect-light structure.

Figure 8. ML-YOLOv8-light network structure.

Figure 9. Annotation diagram.

Figure 10. The accuracy comparison of different detection targets of MSGC is introduced.

Figure 11. The accuracy comparison of different detection targets of LAWD is introduced.

Figure 12. Comparison of accuracy of different detection targets.

Figure 13. Comparison of accuracy of different detection targets before and after improvement.

Figure 14. Comparison of actual detection effect.

Figure 15. Feature Map Visualization Comparison.

Table 1. Experimental environment.

	Name	Specific Information
Hardware configuration	Processor	Intel(R)Core(TM)i5-12490F
	Graphics Card	NVIDIA RTX 3070
	Memory	32 GB
	Video Memory	8 GB
Software environment	Operating system	Windows 10
	Python	3.8
	Pytorch	2.0.0
	CUDA	12.2
	cuDNN	8.7.0

Table 3. Comparison of experimental results with MSGC.

Model	mAP50/%	mAP50-95/%	Params/M	GFLOPs	FPS	Model-Size/MB
YOLOv8n	90.1	65.5	3.01	8.1	112	5.93
YOLOv8n+C2f_MixConv(1,3)	90.3	65.7	2.66	7.2	110	5.61
YOLOv8n+C2f_MixConv(3,5)	90.2	65.4	2.98	8.0	108	5.89
YOLOv8n+C2f_MixConv(5,7)	90.5	65.8	3.51	9.3	95	6.88
YOLOv8n+C2f_MixConv(1,3,5,7)	90.7	66.0	2.81	7.6	107	5.72
YOLOv8n+C2f_MSGC	91.0	66.2	2.86	7.7	105	5.76

Table 4. Comparison of experimental results with LAWD.

Model	mAP50/%	mAP50-95/%	Params/M	GFLOPs	FPS	Model-Size/MB
YOLOv8n	90.1	65.6	3.01	8.1	112	5.93
YOLOv8n+Avgpool	89.3	64.8	2.39	7.1	124	4.92
YOLOv8n+GroupConv	89.6	65.1	2.42	7.1	116	4.97
YOLOv8n+LAWD	90.5	65.9	2.51	7.3	108	5.11

Table 5. Comparison of experimental results improved network structure.

Model	mAP50/%	mAP50-95/%	Params/M	GFLOPs	FPS	Model-Size/MB
YOLOv8n	90.1	65.6	3.01	8.1	112	5.93
YOLOv8n-Samll	91.3	67.2	2.93	12.4	96	5.89
YOLOv8n-Light1	91.8	68.0	3.05	12.5	92	6.02
YOLOv8n-Light2	91.2	67.0	2.63	7.1	125	5.29
YOLOv8n-S_Light	91.5	67.8	2.75	7.2	117	5.41

Table 6. Results of ablation experiments.

Model	C2f-MSGC	LAWD	S-Light	mAP50/%	mAP50-95/%	Params/M	GFLOPs	FPS	Model-Size/MB
YOLOv8n	×	×	×	90.1	65.6	3.01	8.1	112	5.93
ML-YOLOv8n	√	√	×	91.2	66.9	2.42	7.5	103	4.92
ML-YOLOv8n-Light	√	√	√	92.2	68.7	2.15	6.6	108	4.54

Table 7. Horizontal comparison of experimental results.

Model	mAP50/%	mAP50-95/%	Params/M	GFLOPs	FPS	Model-Size/MB
Faster R-CNN	81.6	57.9	41.13	206.6	41	367.64
YOLOv5s	90.7	65.1	7.02	15.8	89	13.70
YOLOX-S	91.1	66.2	8.94	26.8	72	38.70
YOLOv5n-6.0	90.3	64.9	2.50	7.1	104	5.00
YOLOv6n	89.8	64.5	4.23	11.8	98	8.26
YOLOv7-tiny	91.5	65.7	6.01	13.0	91	11.47
YOLOv8n	90.6	65.9	3.01	8.1	112	5.93
ML-YOLOv8n-Light	92.2	68.7	2.15	6.6	106	4.54

Table 2. Parameter settings.

Parameter Name	Parameter Setting	Parameter Name	Parameter Setting
epochs	200	batch	16
imgsz	640	workers	4
optimizer	SGD	close_mosaic	10
train_iou	0.7	val_iou	0.65

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, X.; Li, Y. A Multiscale Grouped Convolution and Lightweight Adaptive Downsampling-Based Detection of Protective Equipment for Power Workers. Electronics 2024, 13, 2079. https://doi.org/10.3390/electronics13112079

AMA Style

Liu X, Li Y. A Multiscale Grouped Convolution and Lightweight Adaptive Downsampling-Based Detection of Protective Equipment for Power Workers. Electronics. 2024; 13(11):2079. https://doi.org/10.3390/electronics13112079

Chicago/Turabian Style

Liu, Xin, and Yingna Li. 2024. "A Multiscale Grouped Convolution and Lightweight Adaptive Downsampling-Based Detection of Protective Equipment for Power Workers" Electronics 13, no. 11: 2079. https://doi.org/10.3390/electronics13112079

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Multiscale Grouped Convolution and Lightweight Adaptive Downsampling-Based Detection of Protective Equipment for Power Workers

Abstract

1. Introduction

2. YOLOv8

3. ML-YOLOv8n-Light Model Construction

3.1. Designing Lightweight Multiscale Convolutional Block MSGC

3.2. Design Lightweight Adaptive Weight Downsampling (LAWD)

3.3. Lightweight Module Design and Addition of Small Target Detection Layer

4. Experimental Design and Analysis of Results

4.1. Power Safe Attire Dataset

4.2. Experimental Environment and Parameter Settings

4.3. Evaluation Metrics

4.4. Effect of Introducing MSGC on Model Performance

4.5. Effect of Introducing LAWD on Model Performance

4.6. Effect of Small Target Detection Layer and Lightweight Module on Model Performance

4.7. Ablation Experiment

4.8. Horizontal Comparison

4.9. Comparison of Actual Detection Results

4.10. Feature Map Visualization Comparison

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI