1. Introduction
Traditional inspection methods (such as manual inspections and fixed devices) exhibit certain limitations: relying primarily on manual operation, they suffer from issues like low efficiency and high risks. In response to these challenges, the intelligent inspection robot solution has been proposed. Intelligent inspection robots are widely deployed across various domains, including daily life safety [
1], robot navigation [
2], intelligent video surveillance [
3], traffic scene detection [
4] and aerospace [
5]. To ensure these robots function effectively—for instance, opening substation doors—selecting a suitable object detection algorithm for identifying doors and handles becomes crucial. Within the specific environment of electrical substations, characterized by factors such as poor lighting conditions and excessive distance to detection targets, conventional object detection models may experience missed detections and false detections, thereby compromising detection outcomes. Consequently, enhancing the real-time feature extraction capability of object detection for critical targets under complex environmental conditions is particularly vital.
The object detection algorithms for inspection robots are based on deep learning. By training convolutional neural networks (CNNs) [
6], they automatically learn feature representations within images. This enables handling complex and variable target morphologies and background environments, demonstrating robust feature representation capabilities and strong generalization abilities. Single-stage detectors such as YOLO [
7] and SSD [
8] achieve rapid object detection by placing numerous predefined anchor boxes directly on the image and predicting their categories and locations. Two-stage detectors like Faster R-CNN [
9] first generate candidate regions via a region proposal network (RPN), and then perform classification and bounding box regression on each candidate region to achieve higher-precision detection. A comparative analysis is presented in
Table 1.
To ensure real-time performance, substation inspection robots employ single-stage detection architectures. In recent years, deep learning-based object detection has advanced significantly in both research and applications. Researchers have implemented extensive improvements upon the YOLO series, including optimizations to feature extraction networks, region of interest (RoI) pooling layers, region proposal networks (RPNs), and non-maximum suppression (NMS) modules. These enhancements have continually elevated detection efficiency and accuracy, driving further development of related technologies. For instance, Wu et al. [
10] proposed an Asymptotic Enhancement Network (AENet) integrated with YOLOv3, developing a novel framework named AE-YOLO specifically designed for low-light conditions. To maximize the enhancement network’s benefits for downstream detection tasks, AENet adaptively enhances images through pixel-level enhancement and feature-level enhancement, thereby improving detection performance. Separately, Zhang et al. [
11] introduced a Low-light Image Enhancement Network (LIEN) to adaptively adjust illumination conditions. LIEN can be integrated into YOLOv8 for end-to-end training, eliminating inconsistencies caused by separately trained enhancement modules and instance segmentation algorithms—where improved image quality may not translate to better segmentation results. Moreover, Han et al. [
12] presented 3L-YOLO, a detection method based on YOLOv8n that operates without an image enhancement module. Furthermore, Zhang et al. [
13] developed LLD-YOLO, an enhanced YOLOv11 for low-light vehicle detection. It incorporates modified DarkNet modules adapted from self-calibrated illumination learning for low-light image enhancement via adaptive illumination adjustment and a C3k2-RA feature extraction enhancement module. Finally, Xiao et al. [
14] proposed DHSW-YOLO (DH-SENet-WIoU-YOLO), a deep learning model for real-time detection of daily behaviors in White Muscovy Duck (WMD) flocks under varying lighting conditions. This model meets stringent real-time speed requirements.
Despite significant advancements in YOLO-based object detection (a single-stage approach), it still exhibits suboptimal performance in small object detection. This limitation primarily stems from the low-resolution feature maps generated by YOLO’s network architecture, leading to inadequate feature extraction for small targets. Compared to region proposal-based methods, YOLO achieves lower recall rates, indicating higher susceptibility to missed detections and false detections. These inherent constraints restrict YOLO’s effectiveness in specific application scenarios, necessitating scenario-specific selection and optimization during practical deployment.
To address the aforementioned challenges while leveraging YOLO’s strengths, this study integrates the existing research and derives optimized improvements for low-light environments through extensive experimentation and analysis. The proposed framework builds upon YOLOv11. To enhance the performance of deep convolutional networks in object detection tasks, we implement three key modifications: replacing standard Conv layers with HetConv [
15] in the backbone, incorporating ShuffleAttention (SA) [
16] modules, and introducing Inner-SIoU to improve feature extraction for small objects, thereby mitigating missed/false detections. Additionally, we substitute Conv layers with HetConv in the Neck to achieve model lightweighting. Comparative and ablation experiments conducted on both normal-light and low-light datasets demonstrate that our HSS-YOLO outperforms baseline YOLOv11. The subsequent sections detail the model’s implementation and optimization strategies.
3. Experimental Research and Analysis
This chapter provides a detailed explanation of the model testing process and evaluation metrics. It demonstrates the model’s performance in object detection tasks through comparative experiments; finally, it compares individual modules via ablation studies to validate the superiority of the model.
3.1. Dataset
Model performance directly affects its effectiveness and reliability in practical applications. Testing the proposed model on datasets is an essential step to evaluate its validity and reliability. By testing the model on datasets, an objective and comprehensive assessment of its ability to detect complex scenarios can be made, and comparisons with existing methods can be conducted. This chapter introduces the datasets used, training configurations, and evaluation parameters for the model.
In training deep learning models, selecting appropriate datasets is crucial. The quality, size, and diversity of datasets directly impact the training effectiveness, generalization ability, and practical performance of the model. To evaluate HSS-YOLO’s object detection capability in real dim environments, this study used two datasets: a custom dataset and a public dataset. To enhance the recognition ability of the custom dataset in other scenarios, additional scenes from the public dataset were included. The public dataset is the DoorDetect Dataset [
17].
The custom dataset consists of 6.75 k handle images, containing three basic target categories: door, handle, and knob. The HandleData was split into training, validation, and test sets at a ratio of 8:1:1, with detailed label information shown in
Figure 7: Display of local datasets. This study trained for 250 epochs to achieve better model convergence. The resolution of the datasets used in this experiment is uniformly 640 × 640.
During training, selecting appropriate training parameters and hardware configurations is crucial for model convergence, generalization, and training efficiency. Incorrect parameter settings may result in poor model performance and negatively affect the model’s effectiveness in real-world tasks.
The model validation experimental environment was established using an Intel(R) Core(TM) i7-13650HX CPU(Manufactured by Intel Corporation) running at 4.90 GHz, with 32 GB of RAM, an NVIDIA GeForce RTX 4060 GPU (Manufactured by NVIDIA)with 8 GB of video memory, and Windows 11 Professional operating system. This setup used Python 3.10.15 programming language and the PyTorch2.0.0 deep learning framework. The specific hardware (
Table 2) and software parameters (
Table 3) are as follows:
Relevant parameter settings (default values are used for those):
3.2. Model Experiments and Analysis
3.2.1. Comparative Experiments
Comparative experiments systematically compare the responses of different groups under specific conditions, providing strong support for scientific research and practical decision-making. Using different datasets, the generalization and robustness of HSS-YOLO were validated.
As shown in
Table 4, HSS-YOLO demonstrated improved performance compared to YOLO11 on both datasets. On the custom dataset, precision increased by 4.9%, recall by 2.7%, and mAP@0.5 by 4%. On the DoorDetect Dataset, precision increased by 1.3%, recall by 2.9%, and mAP@0.5 by 1.3%. Based on these comparisons, it can be concluded that HSS-YOLO outperforms YOLO11.
3.2.2. Ablation Experiments
This ablation experiment verifies the effects of HetConv, ShuffleAttention, and Inner-SIoU on two different datasets.
From
Table 5, compared with the baseline YOLO11, HSS-YOLO enhanced the accuracy and stability of object detection. First, the HSS-YOLO model reduced parameters by 10.7% and GFLOPs by 9%, making the model more lightweight and easier to deploy and run on resource-limited devices (such as inspection robots). Second, it improved precision by 4.6%, recall by 2.8%, and mAP@0.5 by 4%, significantly increasing detection of small targets and key targets in complex scenarios, proving that the model has high accuracy and stability for key target detection in dim environments. The FPS reached 92, meeting real-time detection requirements.
YOLO11 + HetConv uses the lightweight HetConv, which reduces the number of parameters by at least 40% compared to the original Conv. From
Table 5, the introduction of the HetConv module not only reduced the number of parameters and computation of the model, but also improved the precision by 2.5%, which improved the detection of critical targets.
YOLO11 + ShuffleAttention, which adds a ShuffleAttention layer before the C2PSA of the backbone network, is a module that not only enhances the performance of the deep convolutional network when it comes to the detection of key targets, but also enables the network to pay more attention to the important features in the image while suppressing irrelevant information by introducing an attention mechanism in the neural network.
YOLO11 + Inner-SIoU greatly improved precision by 3.8% and mAP@0.5 by 4.3%. It retained the basic properties of IoU while improving the evaluation accuracy of overlapping bounding boxes through a new calculation method. Inner-SIoU provided finer evaluation capability for the model and enabled more accurate recognition and localization of targets in complex scenes.
Finally, HSS-YOLO achieved the best results in this experiment, with improvements of 4.9% precision, 2.8% recall, and 4% mAP@0.5, indicating that the model provides higher accuracy.
Table 6 shows that HSS-YOLO maintains the same advantages in ablation experiments on the public dataset as on the custom dataset, improving its detection capability.
Overall, the experimental and data analyses of each module demonstrated that HSS-YOLO greatly strengthened detection ability for key targets. Subsequent experiments also proved that the model had strong recognition capability in dim environments of power distribution rooms, enhancing the model’s generalization and accuracy.
3.2.3. Comparison of Other Versions
This experiment employed a custom-built dataset to conduct a comparative analysis between HSS-YOLO and other versions of the YOLO architecture.
The final experimental results in
Table 7 demonstrate that HSS-YOLO exhibited superior performance on this specific dataset compared to the alternative versions.
3.3. Experiments and Analysis
3.3.1. Experimental Environment
To ensure accuracy and authenticity, all the experiments were conducted under the same environment.
3.3.2. Comparison of Object Detection Inference Images
HSS-YOLO and YOLO11 perform object detection comparisons using test datasets from different scenarios. The specific experimental comparison images are shown below:
Figure 8 shows the operating scenario required by the inspection robot. Under dim lighting and interference light, YOLO11 exhibited missed detections, while HSS-YOLO perfectly solved this issue and achieved higher confidence.
Figure 9 compares images showing key target recognition in an outdoor dim area. The comparison between HSS-YOLO and YOLO11 indicates that HSS-YOLO significantly improved detection of small targets such as door handles, demonstrating a positive improvement in key target detection under dim conditions.
Figure 10 and
Figure 11 display small target detection in brighter areas. The comparison clearly shows missed detections by YOLO11, which were effectively compensated by HSS-YOLO, resulting in high-confidence inference. The improved model showed a substantial enhancement compared to YOLO11.
Figure 12 also shows a significant improvement in door handle recognition in the same indoor environment.
3.3.3. Heatmap Evaluation
Gradient-weighted Class Activation Mapping (Grad-CAM) is used to visualize heatmaps as a method to interpret the decisions of convolutional neural networks. Heatmaps can be enlarged and overlaid on the original images to highlight the regions the model focuses on most during classification. An advantage of Grad-CAM is that it can be applied to any convolutional neural network without requiring structural modifications or retraining. It provides a simple yet intuitive way to understand the model’s decision for a given input. Experimental results show that the deeper the red hue and the stronger the color intensity in a region, the better the object detection performance.
Comparison of Grad-CAM heatmaps between YOLO11 and HSS-YOLO:
As introduced above, the deeper the red hue and the stronger the color intensity in a region, the better the object detection performance. By comparing the heatmaps in
Figure 13,
Figure 14 and
Figure 15, it is clear that HSS-YOLO showed deeper colors than YOLO11 for key target detection under the same experimental scenarios, indicating that HSS-YOLO achieved better accuracy in detecting key targets.
3.4. Model Evaluation Summary
In the case of darker environments, higher demands are, therefore, placed on the accuracy and stability of the model. Through the above experimental comparisons and analyses of object recognition images and heatmaps under the same scenarios, it can be concluded that HSS-YOLO outperforms YOLO11 in both dim and other environments in terms of object recognition and heatmap quality. Therefore, compared to YOLO11, HSS-YOLO demonstrates better accuracy and stability.