HSS-YOLO Lightweight Object Detection Model for Intelligent Inspection Robots in Power Distribution Rooms

Li, Liang; He, Yangfei; Wei, Yingying; Pu, Hucheng; He, Xiangge; Li, Chunlei; Zhang, Weiliang

doi:10.3390/a18080495

Open AccessArticle

HSS-YOLO Lightweight Object Detection Model for Intelligent Inspection Robots in Power Distribution Rooms

by

Liang Li

^1,2,

Yangfei He

^1,2,

Yingying Wei

^1,2,

Hucheng Pu

^1,2,

Xiangge He

^1,2,

Chunlei Li

^1,2,*

and

Weiliang Zhang

^1,2

¹

Shaanxi Key Laboratory of Advanced Manufacturing and Evaluation of Robot Key Components, Baoji 721016, China

²

School of Mechanical Engineering, Baoji University of Arts and Sciences, Baoji 721016, China

^*

Author to whom correspondence should be addressed.

Algorithms 2025, 18(8), 495; https://doi.org/10.3390/a18080495

Submission received: 8 July 2025 / Revised: 1 August 2025 / Accepted: 6 August 2025 / Published: 8 August 2025

(This article belongs to the Section Algorithms for Multidisciplinary Applications)

Download

Browse Figures

Versions Notes

Abstract

Currently, YOLO-based object detection is widely employed in intelligent inspection robots. However, under interference factors present in dimly lit substation environments, YOLO exhibits issues such as excessively low accuracy, missed detections, and false detections for critical targets. To address these problems, this paper proposes HSS-YOLO, a lightweight object detection model based on YOLOv11. Initially, HetConv is introduced. By combining convolutional kernels of different sizes, it reduces the required number of floating-point operations (FLOPs) and enhances computational efficiency. Subsequently, the integration of Inner-SIoU strengthens the recognition capability for small targets within dim environments. Finally, ShuffleAttention is incorporated to mitigate problems like missed or false detections of small targets under low-light conditions. The experimental results demonstrate that on a custom dataset, the model achieves a precision of 90.5% for critical targets (doors and two types of handles). This represents a 4.6% improvement over YOLOv11, while also reducing parameter count by 10.7% and computational load by 9%. Furthermore, evaluations on public datasets confirm that the proposed model surpasses YOLOv11 in assessment metrics. The improved model presented in this study not only achieves lightweight design but also yields more accurate detection results for doors and handles within dimly lit substation environments.

Keywords:

YOLO11 improvement; dim environment object detection; HetConv; ShuffleAttention; Inner-SIoU

1. Introduction

Traditional inspection methods (such as manual inspections and fixed devices) exhibit certain limitations: relying primarily on manual operation, they suffer from issues like low efficiency and high risks. In response to these challenges, the intelligent inspection robot solution has been proposed. Intelligent inspection robots are widely deployed across various domains, including daily life safety [1], robot navigation [2], intelligent video surveillance [3], traffic scene detection [4] and aerospace [5]. To ensure these robots function effectively—for instance, opening substation doors—selecting a suitable object detection algorithm for identifying doors and handles becomes crucial. Within the specific environment of electrical substations, characterized by factors such as poor lighting conditions and excessive distance to detection targets, conventional object detection models may experience missed detections and false detections, thereby compromising detection outcomes. Consequently, enhancing the real-time feature extraction capability of object detection for critical targets under complex environmental conditions is particularly vital.

The object detection algorithms for inspection robots are based on deep learning. By training convolutional neural networks (CNNs) [6], they automatically learn feature representations within images. This enables handling complex and variable target morphologies and background environments, demonstrating robust feature representation capabilities and strong generalization abilities. Single-stage detectors such as YOLO [7] and SSD [8] achieve rapid object detection by placing numerous predefined anchor boxes directly on the image and predicting their categories and locations. Two-stage detectors like Faster R-CNN [9] first generate candidate regions via a region proposal network (RPN), and then perform classification and bounding box regression on each candidate region to achieve higher-precision detection. A comparative analysis is presented in Table 1.

To ensure real-time performance, substation inspection robots employ single-stage detection architectures. In recent years, deep learning-based object detection has advanced significantly in both research and applications. Researchers have implemented extensive improvements upon the YOLO series, including optimizations to feature extraction networks, region of interest (RoI) pooling layers, region proposal networks (RPNs), and non-maximum suppression (NMS) modules. These enhancements have continually elevated detection efficiency and accuracy, driving further development of related technologies. For instance, Wu et al. [10] proposed an Asymptotic Enhancement Network (AENet) integrated with YOLOv3, developing a novel framework named AE-YOLO specifically designed for low-light conditions. To maximize the enhancement network’s benefits for downstream detection tasks, AENet adaptively enhances images through pixel-level enhancement and feature-level enhancement, thereby improving detection performance. Separately, Zhang et al. [11] introduced a Low-light Image Enhancement Network (LIEN) to adaptively adjust illumination conditions. LIEN can be integrated into YOLOv8 for end-to-end training, eliminating inconsistencies caused by separately trained enhancement modules and instance segmentation algorithms—where improved image quality may not translate to better segmentation results. Moreover, Han et al. [12] presented 3L-YOLO, a detection method based on YOLOv8n that operates without an image enhancement module. Furthermore, Zhang et al. [13] developed LLD-YOLO, an enhanced YOLOv11 for low-light vehicle detection. It incorporates modified DarkNet modules adapted from self-calibrated illumination learning for low-light image enhancement via adaptive illumination adjustment and a C3k2-RA feature extraction enhancement module. Finally, Xiao et al. [14] proposed DHSW-YOLO (DH-SENet-WIoU-YOLO), a deep learning model for real-time detection of daily behaviors in White Muscovy Duck (WMD) flocks under varying lighting conditions. This model meets stringent real-time speed requirements.

Despite significant advancements in YOLO-based object detection (a single-stage approach), it still exhibits suboptimal performance in small object detection. This limitation primarily stems from the low-resolution feature maps generated by YOLO’s network architecture, leading to inadequate feature extraction for small targets. Compared to region proposal-based methods, YOLO achieves lower recall rates, indicating higher susceptibility to missed detections and false detections. These inherent constraints restrict YOLO’s effectiveness in specific application scenarios, necessitating scenario-specific selection and optimization during practical deployment.

To address the aforementioned challenges while leveraging YOLO’s strengths, this study integrates the existing research and derives optimized improvements for low-light environments through extensive experimentation and analysis. The proposed framework builds upon YOLOv11. To enhance the performance of deep convolutional networks in object detection tasks, we implement three key modifications: replacing standard Conv layers with HetConv [15] in the backbone, incorporating ShuffleAttention (SA) [16] modules, and introducing Inner-SIoU to improve feature extraction for small objects, thereby mitigating missed/false detections. Additionally, we substitute Conv layers with HetConv in the Neck to achieve model lightweighting. Comparative and ablation experiments conducted on both normal-light and low-light datasets demonstrate that our HSS-YOLO outperforms baseline YOLOv11. The subsequent sections detail the model’s implementation and optimization strategies.

2. Improve Based on HSS

2.1. Introduction to YOLO11

The foundational model employed in this study is YOLOv11, an iterative version of real-time object detection proposed by the Ultralytics YOLO series [7]. YOLOv11 not only excels in object detection but also offers advanced capabilities, including instance segmentation, pose estimation, and object tracking.

2.2. YOLO11 Improved Based on HSS

Based on HSS (HetConv ShuffleAttention Inner-SIoU), YOLO11 is improved to enhance the accuracy of object detection and feature extraction. This method replaces the last three Conv layers in the backbone with HetConv, effectively reducing computational resource consumption; it also adds ShuffleAttention before the backbone’s C2PSA feature extraction and fusion layers to accelerate information processing. To prevent missed detections, Inner-SIoU is introduced to strengthen target extraction ability under dim environments. Finally, Conv in the Neck layer is replaced with HetConv. By training on image feature extraction of key targets in different environments, this model not only achieves better performance than YOLO11 and improves recognition and detection efficiency but also achieves a lightweight design, making it suitable for deployment on resource-constrained devices and scenarios. The HSS-YOLO structure is shown in Figure 1.

Next, the improvement ideas and underlying principles will be explained in detail.

2.2.1. Introduction of HetConv

HetConv can handle heterogeneous and irregular input data, which traditional convolution cannot. By combining convolution kernels of different sizes (3 × 3 and 1 × 1), it reduces the number of floating-point operations required, thereby improving computational efficiency. Since a 1 × 1 convolutional kernel requires fewer parameters than a 3 × 3 convolutional kernel, HetConv is able to reduce the number of parameters in the model while maintaining the accuracy of the feature extraction capability.

Figure 2 illustrates the difference between a standard convolutional filter (homogeneous) and a heterogeneous convolutional filter (HetConv): M represents the input depth (number of input channels) and P denotes the portion. Among the M kernels, M/P kernels are of size 3 × 3, while the rest are 1 × 1 kernels.

In the P portion of the HetConv filters at layer L, the computational cost of the K × K kernels is given by (1):

{F L}_{k} = \frac{D \times D \times M \times N \times K \times K}{P}

(1)

The remaining (M − M/P) kernels are of size 1 × 1. The computational cost of these remaining 1 × 1 kernels is given by (2):

{F L}_{1} = (D_{o} \times D_{o} \times N) \times (M - \frac{M}{P})

(2)

The total computational cost at layer L is given by (3):

{F L}_{H C} = {F L}_{k} + {F L}_{1}

(3)

HetConv plays an important role in handling complex tasks and large-scale datasets. This module offers significant advantages in efficient computation, reducing computational load, and optimizing resource utilization, providing strong support for the training and inference of deep learning models.

HetConv (heterogeneous convolution) not only reduces the number of channels in the output feature map but also simultaneously captures multi-scale features. In this experiment, we use HetConv FilterConv with p = 2, as shown in Figure 3. The computational cost of Conv is calculated by (4):

F L = D \times D \times M \times N \times K \times K

(4)

In the P portion of the HetConv filters at layer L, the computational cost of the K × K kernels is given by (5):

{F L}_{s} = \frac{D \times D \times M \times N \times K \times K}{P}

(5)

HetConv can be applied to various deep learning tasks, especially when multi-scale feature capture and output channel reduction are desired, such as in lightweight network design, real-time applications, or resource-constrained environments, where HetConv may be a better choice.

2.2.2. Integration of ShuffleAttention

The SA (ShuffleAttention) module is primarily used to enhance deep convolutional networks in tasks such as image classification, object detection, and instance segmentation. By introducing an attention mechanism into the neural network, it enables the network to focus more on important features in the image while suppressing irrelevant information.

In this study, the effect demonstrated by the SA module is excellent. In terms of the balance between efficiency and effect, the SA module effectively integrates two types of attention mechanisms. It not only maintains the light weight of the model but also significantly improves the performance of the model. In terms of parallel processing, by grouping and parallel processing the input feature maps, it can effectively reduce the consumption of computing data and accelerate the information processing speed at the same time. It also exhibits great flexibility and generalization capability. In this experiment, the YOLO11 + SA module achieved impressive results, with mAP reaching 99.2%, precision at 98.5%, and recall at 99.2%.

ShuffleAttention is an efficient attention mechanism mainly used in convolutional neural networks (CNNs) for computer vision tasks. It enhances cross-channel information exchange by introducing a shuffle operation while maintaining computational efficiency. Figure 4 illustrates the process of the model:

SA first performs channel grouping and channel shuffle. Channel grouping S partitions the channels of the input feature map into multiple groups, with the number of groups determined by the parameter G. This grouping reduces computational cost while enhancing the representational capacity of the network. After the attention computation, channel shuffle is applied to rearrange the order of channels, promoting diversity and joint modeling of different features.

Firstly, global average pooling is performed; secondly, a design linear transformation with activation is performed; and finally, the weights of the channel features are adjusted by applying the obtained channel attention weights to the corresponding feature maps using the Sigmoid activation function.

The final output of the channel attention calculation is given by (6) and (7):

s = F_{g p} (X_{k 1}) = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} X_{k 1} (i, j)

(6)

X_{k 1}^{'} = σ (F_{c} (s)) \times X_{k 1} = σ (W_{1} s + b_{1}) \times X_{k 1}

(7)

There are only two variable parameters involved, as shown in (8):

W_{1} \in R^{C / 2 G \times 1 \times 1} b_{1} \in R^{C / 2 G \times 1 \times 1}

(8)

Next is the spatial attention calculation, with the final output given by (9):

X_{k 2}^{'} = σ (W_{2} \times G N (X_{k 2}) + b_{2}) \times X_{k 2}

(9)

There are also only two variable parameters, W2 and b2, which correspond to W1 and b1 in the channel attention calculation.

Finally, feature fusion and output are performed. The final output is given by (10):

X_{k}^{'} = [X_{k 1}^{'}, X_{k 2}^{'}] \in R^{C / G \times H \times W}

(10)

ShuffleAttention enhances model diversity and expressive power by combining channel attention and spatial attention with the introduction of channel shuffle operations. The SA module effectively fuses the two types of attention mechanisms, which not only keeps the model lightweight, but also significantly improves its performance. In this study, the SA module played a crucial role by increasing model stability when handling complex environments and was essential for improving accuracy.

2.2.3. Introduction of Inner-SIoU

This model uses the Inner-SIoU loss function instead of CIoU and GIoU, aiming to enhance the detection of small targets in complex environments.

IoU is commonly used as a standard metric to evaluate object detection models. It quantifies the overlap between the predicted bounding box and the ground truth box by calculating the ratio of the area of their intersection to the area of their union.

In object detection tasks, small targets are often difficult to detect accurately due to their limited number of pixels. Traditional IoU loss functions may fail to provide sufficient gradient information for optimizing bounding box regression when handling small targets. Inner-SIoU accelerates the regression process by introducing auxiliary bounding boxes, making it especially suitable for small object detection. By adjusting the scale of these auxiliary bounding boxes, Inner-SIoU provides more gradient information for small targets, thereby improving detection performance. The introduction of auxiliary bounding boxes and dynamic scaling factors allows the model to adapt more flexibly to different detection tasks and datasets.

As shown in Figure 5,

w^{c}

and

h^{c}

represent the width and height of the minimum enclosing rectangle that contains both the predicted and ground truth bounding boxes, respectively, and are used to calculate the Euclidean distance between two specified points.

In the experiments conducted under dim environments, Inner-SIoU demonstrated excellent performance across various detection tasks, showing strong generalization ability. Typically, in training object detection models, the convergence speed directly affects both training efficiency and final performance. Therefore, Inner-SIoU focuses on the core part of the bounding box rather than the whole, providing a more precise evaluation of the overlapping region. This accuracy helps to speed up the convergence process of the model, allowing it to achieve better performance in a shorter period of time. The process is shown in Figure 6.

x_{c}^{g t}

,

x_{c}^{g t}

are the center points of the Target Box and Inner Target Box, respectively, while

x_{c}

,

y_{c}

are the center points of the Anchor and Inner Anchor, respectively.

3. Experimental Research and Analysis

This chapter provides a detailed explanation of the model testing process and evaluation metrics. It demonstrates the model’s performance in object detection tasks through comparative experiments; finally, it compares individual modules via ablation studies to validate the superiority of the model.

3.1. Dataset

Model performance directly affects its effectiveness and reliability in practical applications. Testing the proposed model on datasets is an essential step to evaluate its validity and reliability. By testing the model on datasets, an objective and comprehensive assessment of its ability to detect complex scenarios can be made, and comparisons with existing methods can be conducted. This chapter introduces the datasets used, training configurations, and evaluation parameters for the model.

In training deep learning models, selecting appropriate datasets is crucial. The quality, size, and diversity of datasets directly impact the training effectiveness, generalization ability, and practical performance of the model. To evaluate HSS-YOLO’s object detection capability in real dim environments, this study used two datasets: a custom dataset and a public dataset. To enhance the recognition ability of the custom dataset in other scenarios, additional scenes from the public dataset were included. The public dataset is the DoorDetect Dataset [17].

The custom dataset consists of 6.75 k handle images, containing three basic target categories: door, handle, and knob. The HandleData was split into training, validation, and test sets at a ratio of 8:1:1, with detailed label information shown in Figure 7: Display of local datasets. This study trained for 250 epochs to achieve better model convergence. The resolution of the datasets used in this experiment is uniformly 640 × 640.

During training, selecting appropriate training parameters and hardware configurations is crucial for model convergence, generalization, and training efficiency. Incorrect parameter settings may result in poor model performance and negatively affect the model’s effectiveness in real-world tasks.

The model validation experimental environment was established using an Intel(R) Core(TM) i7-13650HX CPU(Manufactured by Intel Corporation) running at 4.90 GHz, with 32 GB of RAM, an NVIDIA GeForce RTX 4060 GPU (Manufactured by NVIDIA)with 8 GB of video memory, and Windows 11 Professional operating system. This setup used Python 3.10.15 programming language and the PyTorch2.0.0 deep learning framework. The specific hardware (Table 2) and software parameters (Table 3) are as follows:

Relevant parameter settings (default values are used for those):

3.2. Model Experiments and Analysis

3.2.1. Comparative Experiments

Comparative experiments systematically compare the responses of different groups under specific conditions, providing strong support for scientific research and practical decision-making. Using different datasets, the generalization and robustness of HSS-YOLO were validated.

As shown in Table 4, HSS-YOLO demonstrated improved performance compared to YOLO11 on both datasets. On the custom dataset, precision increased by 4.9%, recall by 2.7%, and mAP@0.5 by 4%. On the DoorDetect Dataset, precision increased by 1.3%, recall by 2.9%, and mAP@0.5 by 1.3%. Based on these comparisons, it can be concluded that HSS-YOLO outperforms YOLO11.

3.2.2. Ablation Experiments

This ablation experiment verifies the effects of HetConv, ShuffleAttention, and Inner-SIoU on two different datasets.

From Table 5, compared with the baseline YOLO11, HSS-YOLO enhanced the accuracy and stability of object detection. First, the HSS-YOLO model reduced parameters by 10.7% and GFLOPs by 9%, making the model more lightweight and easier to deploy and run on resource-limited devices (such as inspection robots). Second, it improved precision by 4.6%, recall by 2.8%, and mAP@0.5 by 4%, significantly increasing detection of small targets and key targets in complex scenarios, proving that the model has high accuracy and stability for key target detection in dim environments. The FPS reached 92, meeting real-time detection requirements.

YOLO11 + HetConv uses the lightweight HetConv, which reduces the number of parameters by at least 40% compared to the original Conv. From Table 5, the introduction of the HetConv module not only reduced the number of parameters and computation of the model, but also improved the precision by 2.5%, which improved the detection of critical targets.

YOLO11 + ShuffleAttention, which adds a ShuffleAttention layer before the C2PSA of the backbone network, is a module that not only enhances the performance of the deep convolutional network when it comes to the detection of key targets, but also enables the network to pay more attention to the important features in the image while suppressing irrelevant information by introducing an attention mechanism in the neural network.

YOLO11 + Inner-SIoU greatly improved precision by 3.8% and mAP@0.5 by 4.3%. It retained the basic properties of IoU while improving the evaluation accuracy of overlapping bounding boxes through a new calculation method. Inner-SIoU provided finer evaluation capability for the model and enabled more accurate recognition and localization of targets in complex scenes.

Finally, HSS-YOLO achieved the best results in this experiment, with improvements of 4.9% precision, 2.8% recall, and 4% mAP@0.5, indicating that the model provides higher accuracy.

Table 6 shows that HSS-YOLO maintains the same advantages in ablation experiments on the public dataset as on the custom dataset, improving its detection capability.

Overall, the experimental and data analyses of each module demonstrated that HSS-YOLO greatly strengthened detection ability for key targets. Subsequent experiments also proved that the model had strong recognition capability in dim environments of power distribution rooms, enhancing the model’s generalization and accuracy.

3.2.3. Comparison of Other Versions

This experiment employed a custom-built dataset to conduct a comparative analysis between HSS-YOLO and other versions of the YOLO architecture.

The final experimental results in Table 7 demonstrate that HSS-YOLO exhibited superior performance on this specific dataset compared to the alternative versions.

3.3. Experiments and Analysis

3.3.1. Experimental Environment

To ensure accuracy and authenticity, all the experiments were conducted under the same environment.

3.3.2. Comparison of Object Detection Inference Images

HSS-YOLO and YOLO11 perform object detection comparisons using test datasets from different scenarios. The specific experimental comparison images are shown below:

Figure 8 shows the operating scenario required by the inspection robot. Under dim lighting and interference light, YOLO11 exhibited missed detections, while HSS-YOLO perfectly solved this issue and achieved higher confidence.

Figure 9 compares images showing key target recognition in an outdoor dim area. The comparison between HSS-YOLO and YOLO11 indicates that HSS-YOLO significantly improved detection of small targets such as door handles, demonstrating a positive improvement in key target detection under dim conditions.

Figure 10 and Figure 11 display small target detection in brighter areas. The comparison clearly shows missed detections by YOLO11, which were effectively compensated by HSS-YOLO, resulting in high-confidence inference. The improved model showed a substantial enhancement compared to YOLO11.

Figure 12 also shows a significant improvement in door handle recognition in the same indoor environment.

3.3.3. Heatmap Evaluation

Gradient-weighted Class Activation Mapping (Grad-CAM) is used to visualize heatmaps as a method to interpret the decisions of convolutional neural networks. Heatmaps can be enlarged and overlaid on the original images to highlight the regions the model focuses on most during classification. An advantage of Grad-CAM is that it can be applied to any convolutional neural network without requiring structural modifications or retraining. It provides a simple yet intuitive way to understand the model’s decision for a given input. Experimental results show that the deeper the red hue and the stronger the color intensity in a region, the better the object detection performance.

Comparison of Grad-CAM heatmaps between YOLO11 and HSS-YOLO:

As introduced above, the deeper the red hue and the stronger the color intensity in a region, the better the object detection performance. By comparing the heatmaps in Figure 13, Figure 14 and Figure 15, it is clear that HSS-YOLO showed deeper colors than YOLO11 for key target detection under the same experimental scenarios, indicating that HSS-YOLO achieved better accuracy in detecting key targets.

3.4. Model Evaluation Summary

In the case of darker environments, higher demands are, therefore, placed on the accuracy and stability of the model. Through the above experimental comparisons and analyses of object recognition images and heatmaps under the same scenarios, it can be concluded that HSS-YOLO outperforms YOLO11 in both dim and other environments in terms of object recognition and heatmap quality. Therefore, compared to YOLO11, HSS-YOLO demonstrates better accuracy and stability.

4. Conclusions

The above experiments show that HSS-YOLO improves the accuracy and stability of YOLO11. Through the comparison test and the ablation experiment, it can be obtained that the real-time target detection ability of HSS-YOLO has been substantially improved compared to YOLO11, and it shows a stronger ability in key target detection, especially for the key target detection of small targets, where the effect is more obvious. This also shows that the model’s key targets have accurate recognition and classification ability, which is suitable for the work required to be performed by the intelligent inspection robot in the power distribution room, and through the open dataset, it is obtained that the model has a wide range of application potential and practical significance in the target detection task.

Author Contributions

L.L. provided guidance on the revision of papers that have been published in the research direction, Y.H. realised the collection of datasets and the realisation of specific schemes, Y.W. and H.P. provided suggestions on the modification of technical solutions, X.H. annotated the dataset, and C.L. and W.Z. prepared funds and submitted them. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Key Project of Shaanxi Provincial Department of Science and Technology (Program No.2024QY2-GJHX-38).

Conflicts of Interest

The authors declare no conflict of interest.

References

Arief, L.; Tantowi, A.Z.; Novani, N.P.; Sundara, T.A. Implementation of YOLO and Smoke Sensor for Automating Public Service Announcement of Cigarette’s Hazard in Public Facilities. In Proceedings of the 2020 International Conference on Information Technology Systems and Innovation (ICITSI), Bandung, Indonesia, 19–23 October 2020; pp. 101–107. [Google Scholar] [CrossRef]
Dos Reis, D.H.; Welfer, D.; Leite, D.S.; Cuadros, M.A.; Gamarra, D.F.T. Mobile Robot Navigation Using an Object Recognition Software with RGBD Images and the YOLO Algorithm. Appl. Artif. Intell. 2019, 33, 1290–1305. [Google Scholar] [CrossRef]
Lin, C.-J.; Jhang, J.-Y. Intelligent Traffic-Monitoring System Based on YOLO and Convolutional Fuzzy Neural Networks. IEEE Access 2022, 10, 14120–14133. [Google Scholar] [CrossRef]
Lin, J.-P.; Sun, M.-T. A YOLO-Based Traffic Counting System. In Proceedings of the 2018 Conference on Technologies and Applications of Artificial Intelligence (TAAI), Taichung, Taiwan, 30 November–2 December 2018; pp. 82–85. [Google Scholar] [CrossRef]
Mushtaq, F.; Ramesh, K.; Deshmukh, S.; Ray, T.; Parimi, C.; Tandon, P.; Jha, P.K. Nuts&bolts: YOLO-v5 and image processing based component identification system. Eng. Appl. Artif. Intell. 2023, 118, 105665. [Google Scholar] [CrossRef]
Itoh, M.; Chua, L.O. Designing CNN Genes. Int. J. Bifurc. Chaos 2003, 10, 2739–2824. [Google Scholar] [CrossRef]
Jiang, P.; Ergu, D.; Liu, F.; Cai, Y.; Ma, B. A Review of Yolo Algorithm Developments. Procedia Comput. Sci. 2022, 199, 1066–1073. [Google Scholar] [CrossRef]
Saini, A.S.; Zhu, T.C. Dose rate and SDD dependence of commercially available diode detectors. Med. Phys. 2004, 31, 914–924. [Google Scholar] [CrossRef] [PubMed]
Alruwaili, M.; Siddiqi, M.H.; Atta, M.N.; Arif, M. Deep learning and ubiquitous systems for disabled people detection using YOLO models. Comput. Hum. Behav. 2024, 154, 108150. [Google Scholar] [CrossRef]
Wu, R.; Huang, W.; Xu, X. AE-YOLO: Asymptotic Enhancement for Low-Light Object Detection. In Proceedings of the 2024 17th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), Shanghai, China, 26–28 October 2024; pp. 1–6. [Google Scholar] [CrossRef]
Zhang, C.; Shen, J. YOLO-DS: Application of One-Stage Instance Segmentation in the Dark at Construction Sites. In Proceedings of the 2024 International Joint Conference on Neural Networks (IJCNN), Yokohama, Japan, 30 June–5 July 2024; pp. 1–8. [Google Scholar] [CrossRef]
Han, Z.; Yue, Z.; Liu, L. 3L-YOLO: A Lightweight Low-Light Object Detection Algorithm. Appl. Sci. 2025, 15, 90. [Google Scholar] [CrossRef]
Zhang, Q.; Guo, W.; Lin, M. LLD-YOLO: A multi-module network for robust vehicle detection in low-light conditions. SIViP 2025, 19, 271. [Google Scholar] [CrossRef]
Xiao, D.; Wang, H.; Liu, Y.; Li, W.; Li, H. DHSW-YOLO: A duck flock daily behavior recognition model adaptable to bright and dark conditions. Comput. Electron. Agric. 2024, 225, 109281. [Google Scholar] [CrossRef]
Singh, P.; Verma, V.K.; Rai, P.; Namboodiri, V.P. HetConv: Heterogeneous Kernel-Based Convolutions for Deep CNNs. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 4830–4839. [Google Scholar] [CrossRef]
Zhang, Q.-L.; Yang, Y.-B. SA-Net: Shuffle Attention for Deep Convolutional Neural Networks. In Proceedings of the ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 2235–2239. [Google Scholar] [CrossRef]
GitHub. MiguelARD/DoorDetect-Dataset: Labelled Image Dataset for Door and Handle Detection; Github: San Francisco, CA, USA, 2021. [Google Scholar]

Figure 1. HSS-YOLO structural diagram.

Figure 2. Difference between standard convolutional filter (homogeneous) and heterogeneous convolutional filter (HetConv).

Figure 3. L convolutional filters at a layer.

Figure 4. Shuffle Attention.

Figure 5. Schematic diagram of EIoU losses.

Figure 6. Description of Inner-IoU.

Figure 7. Display of local datasets.

Figure 8. Comparison of the test set with the dataset (1).

Figure 9. Comparison of the test set with the dataset (2).

Figure 10. Comparison of built-in test sets of datasets (3).

Figure 11. Comparison of built-in test sets of datasets (4).

Figure 12. Comparison of built-in test sets of datasets (5).

Figure 13. Comparison of the heat maps of the test set that comes with the dataset (1).

Figure 14. Comparison of the heat maps of the test set that comes with the dataset (2).

Figure 15. Comparison of the heat maps of the test set that comes with the dataset (3).

Table 1. Object detection algorithm category.

Category	One-Stage	Two-Stage
Advantages	Fast speed; avoids background false positives; learns generalized features of objects	High precision (localization and detection rate); anchor mechanism; shared computation
Disadvantages	Low precision (localization and detection rate); poor detection of small objects	Slow speed; long training time; relatively high false positives

Table 2. Hardware configuration.

Configuration	Releases
CPU	Intel(R) Core(TM) i7-13650HXCPU @ 4.90 GHz
GPU	NVIDIA GeForce RTX4060 8G
RAM	32 G
Python	3.10.15
PyTorch	2.0.0

Table 3. Parameter settings.

ParaMeter Name	Value
Epoch	50
Batch size	16
Input image size	640 × 640

Table 4. Comparative test.

Mold	Parameters	GFlops	Precision	Recall	mAP@0.5	FPS
YOLO11 + Custom Dataset	2.59 M	6.4	0.859	0.787	0.836	100
HSS-YOLO + Custom Dataset	2.29 M	5.8	0.905	0.815	0.876	92
YOLO11 + DoorDetect Dataset	2.59 M	6.4	0.932	0.902	0.949	100
HSS-YOLO11 + DoorDetect Dataset	2.59 M	6.4	0.945	0.931	0.962	105

Table 5. Improved module comparison.

Mold	Parameters	GFlops	Precision	Recall	mAP@0.5	FPS
YOLO11	2.59 M	6.4	0.859	0.787	0.836	100
YOLO11 + HetConv	2.29 M	5.8	0.884	0.791	0.848	90
YOLO11 + ShuffleAttention	2.59 M	6.4	0.864	0.796	0.843	100
YOLO11 + Inner-SIoU	2.59 M	6.4	0.897	0.834	0.879	105
HSS-YOLO	2.29 M	5.8	0.905	0.815	0.876	92

Table 6. Ablation experiments on public datasets.

Mold	Parameters	GFlops	Precision	Recall	mAP@0.5	FPS
YOLO11	2.59 M	6.4	0.932	0.902	0.949	100
YOLO11 + HetConv	2.29 M	5.8	0.930	0.908	0.947	90
YOLO11 + ShuffleAttention	2.59 M	6.4	0.935	0.903	0.948	100
YOLO11 + Inner-SIoU	2.59 M	6.4	0.944	0.937	0.964	105
HSS-YOLO	2.29 M	5.8	0.945	0.931	0.962	105

Table 7. Comparison across different versions.

Mold	Parameters	GFlops	Precision	Recall	mAP@0.5	FPS
YOLO11	2.59 M	6.4	0.859	0.787	0.836	100
HSS-YOLO	2.29 M	5.8	0.905	0.815	0.876	92
YOLOv8	3.0 M	8.2	0.857	0.781	0.831	90
YOLOv5	2.51 M	7.2	0.843	0.756	0.82	90

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, L.; He, Y.; Wei, Y.; Pu, H.; He, X.; Li, C.; Zhang, W. HSS-YOLO Lightweight Object Detection Model for Intelligent Inspection Robots in Power Distribution Rooms. Algorithms 2025, 18, 495. https://doi.org/10.3390/a18080495

AMA Style

Li L, He Y, Wei Y, Pu H, He X, Li C, Zhang W. HSS-YOLO Lightweight Object Detection Model for Intelligent Inspection Robots in Power Distribution Rooms. Algorithms. 2025; 18(8):495. https://doi.org/10.3390/a18080495

Chicago/Turabian Style

Li, Liang, Yangfei He, Yingying Wei, Hucheng Pu, Xiangge He, Chunlei Li, and Weiliang Zhang. 2025. "HSS-YOLO Lightweight Object Detection Model for Intelligent Inspection Robots in Power Distribution Rooms" Algorithms 18, no. 8: 495. https://doi.org/10.3390/a18080495

APA Style

Li, L., He, Y., Wei, Y., Pu, H., He, X., Li, C., & Zhang, W. (2025). HSS-YOLO Lightweight Object Detection Model for Intelligent Inspection Robots in Power Distribution Rooms. Algorithms, 18(8), 495. https://doi.org/10.3390/a18080495

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

HSS-YOLO Lightweight Object Detection Model for Intelligent Inspection Robots in Power Distribution Rooms

Abstract

1. Introduction

2. Improve Based on HSS

2.1. Introduction to YOLO11

2.2. YOLO11 Improved Based on HSS

2.2.1. Introduction of HetConv

2.2.2. Integration of ShuffleAttention

2.2.3. Introduction of Inner-SIoU

3. Experimental Research and Analysis

3.1. Dataset

3.2. Model Experiments and Analysis

3.2.1. Comparative Experiments

3.2.2. Ablation Experiments

3.2.3. Comparison of Other Versions

3.3. Experiments and Analysis

3.3.1. Experimental Environment

3.3.2. Comparison of Object Detection Inference Images

3.3.3. Heatmap Evaluation

3.4. Model Evaluation Summary

4. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI