3.4.1. Comparative Study of Quantitative Experimental Data Under the TACO Dataset
The TACO dataset encompasses numerous waste image examples in various natural environment scenarios, such as forests, roads, and beaches. This characteristic ensures that the dataset is highly diverse and widely representative. Moreover, most of the target objects in this dataset are small-sized. In this study, by comparing the experimental result data of different algorithms on the TACO dataset, we aimed to verify the applicability of the proposed algorithm in complex scenarios and its detection effectiveness for small targets. The detailed comparison results of relevant data are shown in
Table 2.
This research carried out a comprehensive and systematic comparative analysis of mainstream object detection algorithm models in a unified experimental environment. According to the data shown in
Table 2, when the input image size is 300 × 300 pixels, compared with the two-stage algorithm Faster RCNN [
20], the algorithm proposed in this study achieves a significant improvement of 6.5 percentage points in mAP. Furthermore, compared with algorithms such as YOLOv5s, YOLOv7-tiny [
21], YOLOv9s [
22], DETR-R50 [
23], and YOLOv8n-GL [
24], the accuracy of the algorithm in this study is increased by 5.7 percentage points, 10.5 percentage points, 5.1 percentage points, 2.8 percentage points, and 7.7 percentage points, respectively. To achieve the lightweight of the algorithm, YOLOv8n, with the lowest number of parameters, was selected as the base model in this study, and further optimization was carried out on this basis. Under the condition that the input image size is 300 × 300 pixels, the number of parameters of the optimized algorithm is only 2.2 M, which is 0.6 M less than that of the base model. This is because the C2f module in the original model was improved. By replacing the Bottleneck module with the MS module, the parameter redundancy caused by the Bottleneck module in processing large datasets was reduced, thus decreasing the number of parameters. Compared with the DETR series of algorithms, the algorithm in this study shows significant advantages in both the number of parameters and the amount of floating-point operations.
Table 2.
Comparative experimental results of TACO dataset.
Table 2.
Comparative experimental results of TACO dataset.
Method | Image Size | mAP (%) | Params/M | GFLOPs/S | FPS/S |
---|
Faster RCNN | 1000 × 600 | 71.4 | 131.1 | 370.2 | 9.5 |
EfficientDet | 640 × 640 | 53.2 | 4.0 | 10.5 | 74.8 |
YOLOv5s | 500 × 500 | 72.2 | 7.4 | 16.1 | 42.2 |
YOLO7-tiny | 500 × 500 | 67.4 | 6.1 | 13.1 | 43.5 |
YOLOv8n | 640 × 640 | 73.4 | 3.1 | 9.3 | 54.5 |
YOLOv8n | 300 × 300 | 72.8 | 2.8 | 8.2 | 72.4 |
YOLOv9-S | 640 × 640 | 68.3 | 15.6 | 26.4 | 70.6 |
YOLOv10-S [25] | 640 × 640 | 75.1 | 20.8 | 31.2 | 75.0 |
YOLOv11n | 640 × 640 | 77.2 | 2.4 | 7.2 | 82.6 |
DETR-R50 | 1200 × 800 | 65.3 | 41.5 | 136.2 | 98.1 |
DN-DETR-R50 | 1200 × 800 | 74.4 | 44.3 | 157.6 | 102.3 |
RT-DETR-l [26] | 1200 × 800 | 76.9 | 22.2 | 105.6 | 118.5 |
YOLOv8n-GL | 300 × 300 | 75.6 | 3.02 | 8.8 | 60.8 |
MS-YOLO | 300 × 300 | 77.9 | 2.2 | 6.4 | 95.3 |
In summary, the algorithm proposed in this research achieves a mAP of 77.9%, with the number of parameters controlled at 2.2 M and the amount of floating-point operations being only 6.4 S. This algorithm not only improves the ability to detect small targets but also realizes a lightweight design. Moreover, with an FPS of 95.3, it can meet real-time requirements in scenic area scenarios and provides strong support for practical applications.
To present the performance advantages of the improved algorithm in this paper over other mainstream algorithms more intuitively,
Figure 8 shows the curve graph of the changes in mAP in the comparative experiments with mainstream detection algorithms.
This paper adopted a step-by-step verification approach to verify the effectiveness of the algorithm improvement, and specific experiments were conducted on the improved modules. The C2f-MS convolution module, the CEPN feature fusion module, and the hybrid loss function QS-Dot-IoU were introduced into the TACO dataset, respectively, and a series of ablation experiments were created. The results of the ablation experiments are shown in
Table 3.
The results of the ablation experiments indicate that taking YOLOv8n as the base model, the algorithm has a mAP of 72.8%, the number of parameters is 2.8M, and the amount of floating-point operations is 9.3S. After introducing the C2f-MS module to replace the C2f module, the mAP is increased by 2.9 percentage points and the number of parameters is reduced by 0.3M. The C2f-MS module is more lightweight because it can retain the details of multi-scale original features, improve the object recognition accuracy of multi-scale features and the detection ability of small targets, and reduce parameter redundancy. Using the CEPN feature fusion network to replace the original network, the mAP is further increased by 2.7 percentage points. The number of parameters is significantly reduced and the consumption of computing resources is decreased. Thanks to the simple structure, skip connections, and reduced redundancy of CEPN, it can solve the problem of semantic feature loss, capture wide-area semantic features and small target details, make up for the deficiency of small target detection of the original model, and achieve a lighter volume. Replacing the CIoU with the QS-Dot-IoU loss function increases the mAP by 1.5 percentage points, which shows that the hybrid loss function can enhance the small target localization ability, obtain more detailed features, and improve the detection effect. After the comprehensive improvements, the mAP is increased from 72.8 to 77.9, with a total increase of 5.1 percentage points, the number of parameters is reduced from 2.8M to 2.2M, and the GFOLPs is reduced from 9.3S to 6.4S, which confirms the effectiveness of the improvements of the C2f-MS module, the CEPN feature fusion network, and the hybrid loss function.
Finally, a comparative experiment of loss functions was conducted and the results are summarized in
Table 4.
According to the comparative experimental results in
Table 4, under the premise of maintaining the number of parameters and GFLOPs of the benchmark model, the impact on the detection performance is evaluated by replacing different loss functions GIoU, SIoU, CIoU, DIoU, and QS-Dot-IoU. Specifically, after adopting the GIoU loss function, the detection accuracy increased by 1.2 percentage points compared with the original loss function CIoU. The SIoU loss function brought an accuracy increase of 0.4 percentage points. After adopting the DIoU loss function, the detection accuracy decreased by 1.3 percentage points compared with the original loss function. Most significantly, the QS-Dot-IoU loss function proposed in this paper achieved an accuracy improvement of 1.9 percentage points after replacing the CIoU loss function in the benchmark model.
In summary, the QS-Dot-IoU loss function does not measure the difference between the predicted box and the true box from a general geometric perspective, but focuses on the difference between the predicted box and the true box from the perspective of the target shape features, which can enhance the sensitivity to the target shape and improve the detection effect of small targets. At the same time, this loss function also inherits the advantages of QFL and can adjust the weight according to the positioning quality factor , so that the model can focus on the quality of classification and positioning at the same time during training, thereby improving the accuracy of the detection results. Therefore, the QS-Dot-IoU loss function is an effective means to improve the performance of the detection model.
3.4.2. Comparative Study of Quantitative Experimental Data Under the VIA-Img Dataset
The VIA-Img dataset mainly consists of seven categories, with a total of 7963 images, including targets of different scales such as large, medium, and small. In this paper, the entire dataset was first verified. The experimental results and the corresponding change curve graphs are shown in
Table 5 and
Figure 9, respectively.
As can be seen from
Table 4, the mAP of the algorithm in this paper is 73.9%, which is superior to other algorithms. Compared with Faster RCNN, YOLOv5s, YOLO7-tiny, YOLOv8n, YOLOv9-S, YOLOv10-S, and YOLOv8n-GL, the detection accuracy is increased by 2.3 percentage points, 7.1 percentage points, 4.7 percentage points, 3.6 percentage points, 1.8 percentage points, and 2.4 percentage points, respectively. This proves that the algorithm proposed in this paper has an excellent ability to detect multi-scale targets. Moreover, in terms of model parameters, compared with the base model, the number of parameters of the algorithm in this paper is reduced by 1.2 M, and the GFLOPs are reduced by 1.3 s, realizing the lightweight design of the algorithm.
In the general dataset MS COCO, small targets are defined as objects with a resolution of less than 32 × 32 pixels, medium targets are defined as objects with a resolution ranging from 32 × 32 pixels to 96 × 96 pixels, and large targets are defined as objects with a resolution exceeding 96 × 96 pixels. Based on these definitions, we distinguish between large, medium, and small targets in both the self-made garbage dataset and the VIA-Img dataset.
To test the effect on small target detection, the targets of different scales (large, medium, and small) in this dataset were verified. The experimental results are shown in
Table 6.
As shown in
Table 5, compared with YOLOv8n, the APS of the MS-YOLO algorithm proposed in this paper is increased by 6.4 percentage points in small target detection, which proves the algorithm’s effectiveness in small target detection. When detecting medium and large targets, the APM and APL values are increased by 3.7 percentage points and 1.5 percentage points, respectively, which proves that the MS-YOLO algorithm is also effective in multi-scale target detection.
3.4.3. Comparative Study of Quantitative Experimental Data Under the Self-Made Dataset
This article compares the performance of various algorithms on a self-generated garbage dataset to test the generalization ability of the MS-YOLO algorithm. The results of the comparison of the experimental data are shown in
Table 7.
According to
Table 7, compared with the algorithms such as Faster RCNN, YOLOv5s, YOLO7-tiny, YOLOv8n, YOLOv9-S, YOLOv10-S, and YOLOv8n-GL, the mAP value of the algorithm proposed in this study is increased by 8.2 percentage points, 4.4 percentage points, 1.5 percentage points, 1.9 percentage points, 0.3 percentage points, 2.8 percentage points, and 1.2 percentage points, respectively. The proposed algorithm has achieved the optimal mAP value in the self-constructed waste dataset, thus demonstrating better robustness and generalization ability.
To more intuitively present the performance advantages of the improved algorithm in this paper over other mainstream algorithms in the self-constructed waste dataset,
Figure 10 shows the change curve graph in the comparative experiments with mainstream detection algorithms.
To verify the advantages of using the SAK attention mechanism in the C2f-MS module, multiple attention mechanisms, such as CAM [
27], CBAM [
28], and SE [
29], were introduced at the same position as C2f-MS for comparative testing. The experimental results are summarized in
Table 8. According to the experimental data, compared with the base model, introducing the C2f-M-CBAM and C2f-M-SE modules reduces the mAP by 2.3 percentage points and 1.6 percentage points, respectively. However, the addition of the C2f-M-CAM and C2f-M-SAK modules increases the mAP by 0.1 and 0.3 percentage points, respectively, which proves the effectiveness of using the SAK attention mechanism in enhancing the performance of the C2f-MS module. The introduction of the CBAM and SE modules fails to achieve the expected results. The reason may be that they have limitations in information processing. The SE module ignores spatial position information, while the CBAM module has difficulty constructing long-range dependencies. However, SAK fuses information through multiple branches, enhances the multi-resolution feature expression ability, processes features using a multi-scale sub-network and cascades the outputs, and captures the association between global and local information.
Furthermore, a heatmap experiment was conducted on the self-constructed waste dataset, and the results are shown in
Figure 11.
Figure 11a presents the original image, while
Figure 11b displays the heatmap generated without adding an attention mechanism. It can be clearly observed from the figure that the algorithm’s positioning of the key regions is not precise and shows a certain degree of dispersion. In contrast,
Figure 11c is the heatmap generated after introducing the attention mechanism. This figure demonstrates that by incorporating the SAK attention mechanism, the C2f-MS module significantly enhances the algorithm’s focus on the target detection task and enables it to concentrate more precisely on the target regions to be detected. Thus, the effectiveness of the SAK attention mechanism in enhancing the module’s performance is verified.
3.4.4. Qualitative Comparative Study
To intuitively measure the efficacy of the improved algorithm proposed in this chapter on the task of domestic waste classification and detection, a qualitative evaluation experiment on the test set was proposed and implemented in this section. The TACO dataset, the VIA-Img dataset, and the self-constructed waste dataset were selected. The experimental results are presented intuitively through
Figure 12, which compares the detection effects of the traditional baseline model and the improved algorithm model proposed in this chapter.
By observing the detection results of YOLOv8n (shown in
Figure 12b), it is evident that the algorithm performs poorly in small target detection under different scenarios. For the “Plastic Bottle” category in the TACO and the VIA-Img datasets, the mAP values are 71% and 51%, respectively. In the self-constructed dataset, the YOLOv8n algorithm has misdetection issues for the “BottleCup” category. This is because when extracting features from small target images, YOLOv8n is easily interfered with by the cluttered background, resulting in misdetection phenomena. From the detection results under the MS-YOLO algorithm (shown in
Figure 12c), it can be seen that compared with the YOLOv8n algorithm, the mAP values in the TACO dataset and the VIA-Img dataset are increased by 15 percentage points and 34 percentage points, respectively. In the self-constructed dataset, the algorithm proposed in this paper not only correctly identifies the categories of the detected objects but also achieves a mAP of 85%. Therefore, it is proved that the algorithm proposed in this paper has no misdetection of targets and its detection effect is superior to that of the base model.
In the performance evaluation of medium and large targets, this algorithm shows excellent performance. As shown in the fourth and fifth rows, in a complex and changing environment, the recognition accuracy of the original model is 82%, while the recognition accuracy of the improved algorithm in this paper is improved to 85%, a significant increase of 3 percentage points. In an occluded environment, compared with the original model, the recognition accuracy of each target of the improved algorithm proposed in this paper has been improved to varying degrees. In summary, the improved algorithm in this paper also shows good performance in a cluttered and occluded environment.
In the small target detection task, the MS-YOLO algorithm shows significant advantages. Compared with the base model YOLOv8n algorithm, its small target detection accuracy is substantially improved, and the false detection rate is effectively reduced. It can more accurately identify small targets in complex backgrounds. The main reason for the performance improvement is adopting a hybrid loss function strategy. This loss function focuses on considering the difference between the predicted box and the actual box from the perspective of target shape features, enhancing the detection efficiency for shape-sensitive and small targets. Moreover, it weights the category score according to the localization quality, enabling the detection model to focus on difficult samples to localize or classify during training, suppressing the interference of complex backgrounds and improving the accuracy of small target detection.
Simultaneously, the C2f-MS module uses an optimization strategy of retaining partial transmissions. This optimization strategy retains the details of the original features at different scales, significantly improving the recognition accuracy of objects with multi-scale features. Introducing the SAK module, with its unique adaptive selection ability, can selectively enhance the feature representation of key regions, thereby improving the localization accuracy of the algorithm for small target object information. In addition, the CEPN feature fusion module focuses on the extraction layer. Then, through the propagation of the connection layer and the skipping of layers, it captures semantic features in a wider area, enhancing the capture of detailed information of small targets and improving the small target detection ability.
3.4.5. Comparative Study of Generalization Experiments
To verify the extensive applicability of the algorithm proposed in this paper in waste classification and detection tasks, we selected the open-source datasets HGI-30 [
30] and TrashNet [
31] for experimental comparison. We mainly compared the benchmark model and the improved algorithm based on the benchmark model. The experimental results are summarized in
Table 9 and
Table 10, respectively.
Under the HGI-30 dataset, the mAP of the model proposed in this paper is 93.2%, which is higher than that of other models. When comparing the number of parameters with others, the improved algorithm’s number of parameters is only 2.2M, which simplifies the model complexity.
Under the TrashNet dataset, the mAP of the model proposed in this paper is 95.2%, which is higher than that of other models. When comparing the number of parameters with others, the number of parameters of the proposed model is only 2.2 M, achieving a significant reduction in model complexity.
The performance improvements on the two open-source datasets are mainly attributed to the three improvements proposed in this paper. Firstly, the MS module is designed. This module aims to reduce model complexity while ensuring performance by selectively applying convolution kernels of different sizes, making the model more lightweight and efficient. At the same time, the SAK attention mechanism is incorporated. Through this mechanism, convolution kernels of different sizes can be dynamically selected, and the receptive field size can be adaptively adjusted, thereby obtaining more scale information features of small target objects and improving the detection accuracy of small target objects. In addition, the CEPN convergent diffusion pyramid network module is proposed. The MDCR module and the diffusion mechanism are used to solve the problem of semantic feature loss and enhance the capture of detailed information on small targets. Finally, the hybrid loss QS-Dot-IoU function is used. This loss function abandons the geometric perspective and focuses on target shape features to measure the difference between the predicted box and the actual box, effectively improving the sensitivity to target shapes and facilitating small target detection. Meanwhile, inheriting the advantages of QFL, the weight is adjusted according to the localization quality factor q, prompting the model training to balance classification and localization quality and enhancing the accuracy of detection results.
In summary, the improved algorithm in this paper achieves the best performance on both the HGI-30 dataset and the TrashNet dataset, thus demonstrating the proposed algorithm’s better generalization ability.
3.4.6. Algorithm Deployment
To address the issue of waste in scenic areas and demonstrate the practicality of the algorithm in this paper, the detection algorithm proposed herein is deployed on a scenic area mobile robot. The robot collects and classifies waste in scenic areas, replacing manual operations and enhancing efficiency. The mobile robot mainly consists of components such as a deep learning camera, a robotic arm, a mobile chassis, a radar, a controller, and motors.
The experimental process is as follows: (1) Calibrate the focal length of the depth camera to obtain the internal parameters of the camera. Perform hand–eye calibration and depth camera calibration, and calculate the transformation matrices between the robotic arm base and the camera coordinate system and between different camera coordinate systems. (2) Utilize the laser radar mounted on the robot to construct a two-dimensional map of the experimental environment. Set fixed patrol points on the existing map, and based on the environmental map and the path calculated by motion planning, the robot can perform closed-loop autonomous obstacle avoidance movement and conduct environmental detection. (3) When the camera detects waste, the robot moves to a position near the target and stops. Employ the deep learning network model to perform waste classification detection and complete three-dimensional spatial positioning. (4) Calculate the position of the target in the robot coordinate system according to the hand–eye calibration results, derive the target angles of each joint, guide the robotic arm to move to the target pose, and simultaneously control the gripper to close, complete the grasping, and place the waste into the designated trash bin.
The experiment placed the mobile robot in an outdoor park scenario and conducted a grasping experiment using the waste detection algorithm in this paper. The following is the process of object grasping, as shown in
Figure 13. The mobile robot with the deployed algorithm in this paper can accurately grasp waste objects. The detection and recognition success rate is above 87%, and the average pickup time is 5.3 s. The improved algorithm’s real-time performance can meet waste pickup’s operational requirements in scenic areas.