Research on Lightweight Scenic Area Detection Algorithm Based on Small Targets

Zhang, Yu; Wang, Liya

doi:10.3390/electronics14020356

Open AccessArticle

Research on Lightweight Scenic Area Detection Algorithm Based on Small Targets

by

Yu Zhang

and

Liya Wang

^*

College of Science, North China University of Science and Technology, Tangshan 063210, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(2), 356; https://doi.org/10.3390/electronics14020356

Submission received: 20 December 2024 / Revised: 10 January 2025 / Accepted: 15 January 2025 / Published: 17 January 2025

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

Given the difficulty of effectively detecting small target objects using traditional detection technology in current scenic waste disposal settings, this paper proposes an improved detection algorithm based on YOLOv8n deployed on mobile carts. Firstly, the C2f-MS (Middle Spilt) module is proposed to replace the convolution module of the backbone network. Retaining the original feature details of different scales enhances the ability to detect small targets while reducing the number of model parameters. Secondly, the neck network is redesigned, introducing the CEPN (Convergence–Expansion Pyramid Network) to enhance the semantic feature information during transmission. This improves the capture of detailed information about small targets, enabling effective detection. Finally, a QS-Dot-IoU hybrid loss function is proposed. This loss function enhances sensitivity to target shape, simultaneously focuses on classification and localization, improves the detection performance of small targets, and reduces the occurrence of false detections. Experimental results demonstrate that the proposed algorithm outperforms other detection algorithms regarding small targets’ detection performance while maintaining a more compact size.

Keywords:

YOLOv8n; small goals; C2f-MS; CEPN; QS-Dot-IoU; lightweight

1. Introduction

With the increasingly diversified development of human social lifestyles, tourism has gradually evolved into one of the popular leisure choices among the general public. Meanwhile, the sharp increase in the number of tourists has directly led to the growing problem of human-generated waste in scenic areas, which poses a non-non-negligible challenge to the local environment, brings heavy pressure to the daily work of cleaners in scenic areas, and potentially threatens the physical health and safety of tourists. Therefore, sorting and detecting waste in scenic regions is of great significance for the life safety of tourists. In addition, this measure can effectively promote the recycling of recyclable waste, thereby facilitating the circular and sustainable development of resources and achieving a win–win situation for environmental protection and economic growth.

With the continuous development of the artificial intelligence industry, target detection algorithms have been applied to waste sorting. Traditional detection algorithms achieve detection by analyzing the color and shape characteristics of waste. However, in complex waste types and scenarios, they usually fail to reflect the true attributes of waste, resulting in poor detection performance. To solve the problems mentioned above, the convolutional neural network algorithm [1] has emerged and become a hot topic in computer vision research [2]. This model can automatically learn image features and utilize image information more efficiently than traditional algorithms [3], significantly improving performance and computing speed.

Numerous scholars, both at home and abroad, have conducted a series of studies on waste detection. Tu et al. [4] proposed an improved algorithm based on YOLOv5n, which combines the ShuffleNetv2 module and the GhostNet module and incorporates the knowledge distillation algorithm. It achieves an accuracy rate of 89.5% on the HGI-30 dataset. However, this algorithm exhibits poor performance in detecting small and multi-scale targets. Li et al. [5] put forward a waste classification detection method based on the improved YOLOX-tiny. By introducing the attention mechanism CBAM and the GhostBottleneck module, an accuracy rate of 92.66% on the TrashNet dataset is attained. Nevertheless, this algorithm is prone to false detections when dealing with small targets and has a high computational complexity. Yi et al. [6] proposed an approach based on MobileNetV2, incorporating the scale-depth convolution module and the channel shuffle technique. It reaches a classification accuracy rate of 90.58% on the domestic waste classification dataset released by Huawei. However, this algorithm does not apply to scenarios with high real-time requirements. Alsubaei et al. [7] proposed an innovative waste management method named DLSODC-GWM. This method integrates advanced deep learning techniques with optimized decision algorithms and the improved RefineDet model, achieving an accuracy rate of 98.61% on a self-made dataset. Nevertheless, it has unsatisfactory performance in detecting small targets. Sreelakshmi [8] proposed a Capsule Neural Network based on existing networks and applied it to detect plastic waste datasets, obtaining a relatively high classification accuracy. However, this algorithm has excessive computational parameters, which is unfavorable for device deployment and leads to slow speed.

Although the algorithms mentioned above have achieved remarkable progress in domestic waste classification, most studies mainly focus on domestic waste, presenting a single scenario. Secondly, to be deployed on mobile cart devices with limited computing power and memory, the model must have a sufficiently small number of parameters and low computational complexity. Additionally, when dealing with waste in scenic areas, it needs to detect small target objects [9] and have high real-time [10] performance to quickly identify the object categories. The YOLOv8n network structure enables the model to concentrate more effectively on the characteristics of small targets, exhibits good performance in small target detection, and is small in size with high real-time speed. Therefore, YOLOv8n is selected as the baseline model. To address the problems existing in the above algorithms and balance detection accuracy, lightweight nature [11], and real-time performance, this paper proposes the MS-YOLO lightweight detection algorithm for the waste treatment issue in scenic areas. The specific work is as follows: 1. Design the C2f-MS module, which can retain the original feature details of different scales, enhance the detection ability of small targets, and simultaneously reduce the number of model parameters. 2. Propose that the CEPN strengthens the semantic feature information during transmission, improves the capture of small target detail information, and achieves effective detection of small targets. 3. Design the DS-Dot-IoU (Differentiable Shape–Dot–Intersection over Union) hybrid loss function, which can enhance the sensitivity to target shapes, pay attention to both classification and localization simultaneously, improve the detection effect of small targets, and reduce the occurrence of false detections.

The structure of this paper is as follows: Section 2 presents the improvements made to the algorithm. Section 3 describes the datasets used in the experiments, the experimental environment configuration, and the detailed simulation experiments and analysis. Section 4 summarizes this paper.

2. Materials and Methods

The YOLOv8 network architecture mainly consists of four major components: input, backbone network, neck structure, and head module. In the input module, the original image first undergoes preprocessing and data augmentation operations and is then fed into the backbone network for efficient feature extraction. The backbone network incorporates a large number of standard convolution operations, including the C2f module and an SPPF module, to enhance the image features. The neck structure adopts the PANet architecture [12], which further processes and fuses the features extracted by the backbone network, thereby generating three feature maps of different sizes to meet the target detection requirements of various scales. The head module employs a decoupled head [13] design and an anchor-free strategy. For each feature map output by the neck structure, the head module divides it into two independent parts, which are respectively responsible for the prediction of target categories and the prediction of target locations. The improved network model is shown in Figure 1.

2.1. Backbone Network Convolution Module

In the YOLOv8n model, the C2f module enhances feature-processing capabilities through a two-branch structure. Figure 2 shows the structure of the C2f module. The C2f module first undergoes preliminary feature transformation through two convolutional layers, then splits into two paths: one retains the original feature information, and the other undergoes in-depth processing through multiple Bottleneck modules to enhance the non-linear mapping and feature-expressing capabilities of the network.

However, in application scenarios involving multi-scale target detection, such as waste detection, especially when dealing with waste objects of various sizes and shapes, the C2f module may lead to the loss of scale information, resulting in poor detection performance for small target objects. In addition, the C2f module contains multiple Bottleneck blocks. With the increase in network depth, the cumulative effect of these components causes the computational cost to rise sharply when processing large-scale datasets or high-resolution images.

This paper proposes the C2f-MS module to address the challenges above. Through optimization strategies, it improves the recognition accuracy of small target objects, retains the original feature details of different scales, and enhances multi-scale information. Additionally, modifying the original model structure reduces the parameter redundancy caused by the Bottleneck module, decreases the number of parameters, and alleviates the computational burden, enabling the model to operate more efficiently in resource-constrained environments.

The design of the proposed MS module is illustrated in Figure 3. The MS module adopts a more efficient selective application strategy, which averts the problem of scale information loss that might occur in full-channel processing. By selectively applying convolution kernels of different sizes on specific channels, the MS module can more accurately retain and utilize the key scale information in each channel, thereby enhancing information transfer efficiency and model accuracy. Moreover, traditional full-channel processing usually demands many convolution kernels and parameters to capture various features, making the model extremely large and complex. This addition has increased computational costs and may even lead to overfitting issues [14]. Through the selective application of convolution kernels of different sizes, the MS module reduces model complexity while ensuring performance, making the model more lightweight and efficient.

The specific process is as follows: Firstly, the input data are processed by a 3 × 3 small convolution layers, then divided into two parts. One part of the data continues to be transmitted and enters the next 5 × 5 convolution layer for further processing; the other part is directly retained as the information of the current scale. Similarly, the output of the 5 × 5 medium convolution layer is divided into two parts, with one part being transmitted to the 7 × 7 convolution layer and the other continuing to be transmitted with the previously retained information. Subsequently, the original scale information maintained by the 3 × 3 and 5 × 5 convolutions and the scale information of the large convolution are respectively passed through the fully connected layer to fuse the extracted local features of different scales into global features, followed by a concatenation operation. This process realizes the fusion of varying scale information and the retention of the original scale information, ensuring the comprehensiveness and diversity of the information.

After the concatenation operation is completed, the data proceed through a scale transformation operation. The scale transformation operation can adjust the scale of features according to specific requirements. For channels that may contain more noise [15] or are relatively unimportant, a smaller scaling factor can be employed to reduce their influence, thereby enhancing the model’s detection accuracy for small target objects. Finally, a 1 × 1 convolution layer is used to perform a linear combination of features to optimize the feature representation further. Meanwhile, the input features are added to the processed features by introducing a residual connection mechanism. While retaining the integrity of the original information, new multi-scale information is incorporated, significantly strengthening the model’s expressive ability.

Adopting a 7 × 7 large-sized convolution kernel in the MS module provides a larger receptive field, which helps the network to locate and identify target objects more quickly, thereby improving the recognition accuracy and efficiency. Due to its partial processing characteristic, compared with full-channel convolution, the MS module significantly reduces the demand for computational parameters, enabling the model to possess better computational efficiency and resource utilization while maintaining high performance.

Furthermore, in order to more effectively capture the multi-scale features of the image space and enhance the model’s performance in handling small targets, the SAK attention mechanism [16] is incorporated before the 1 × 1 convolution fusion step in the proposed MS module, as shown in Figure 4. Through this mechanism, convolution kernels of different sizes can be dynamically selected, and the size of the receptive field [17] can be adaptively adjusted, thereby obtaining more scale information features of small target objects and improving the detection accuracy of small target objects. The final design of the C2f-MS module is shown in Figure 4.

2.2. Improvement of Feature Extraction Fusion Network

In the feature extraction network used by the YOLOv8n algorithm, during the processing, the detailed information of small targets cannot be fully captured, and the semantic information is lost as the number of network layers increases. Therefore, in order to address the phenomenon of semantic feature loss and improve the capture of detailed information on small targets, this paper proposes the CEPN module, which utilizes the MDCR (Multi-scale Dense Convolutional Reconstruction) module and a diffusion mechanism to solve the problem of semantic feature loss and enhance the capture of detailed information of small targets.

The structure of the CEPN module is shown in Figure 5 below. The left side of the figure is the multi-scale feature extraction module, where

C_{i} (ⅈ = 1, 2, 3, 4, 5)

represents five feature layers from bottom to top. Firstly, these feature layers are successively generated through downsampling operations. The size of the feature map in each layer gradually decreases while the number of channels gradually increases. There is an MDCR module in the C2–C5 layers. The MDCR module can capture semantic features of a wider area and enhance the capture of detailed information on small targets by processing the input features. Through this operation, the P4 layer is formed. Then, with the help of the diffusion mechanism, the rich context information contained in the P4 layer is transferred to the P3 and P5 layers. Finally, skip connections related to the P4 layer are also introduced in the F3–F5 layers to retain more feature information.

The structure of the MDCR module is shown in Figure 6. Firstly, multiple depthwise separable convolution layers with different dilation rates are employed to capture diverse spatial features. Additionally, within the depthwise separable convolution layers, there are multiple convolution layers and residual connections, enabling them to process the input original semantic features. Small convolution kernels can capture locally detailed features, while large convolution kernels can capture semantic features of a broader area. This series of operations improves the capture of detailed information on small targets. Finally, by superposing multiple layers of depthwise separable convolutions, the MDCR module can gradually expand the receptive field, thereby capturing broader spatial context information and enhancing the model’s localization ability and detection accuracy.

The MDCR module divides the input features into four parallel convolution layers, and these four convolution layers respectively generate four different convolution heads, which can be expressed by Formula (1):

(a_{i})_{i = 1}^{4} \in R^{H \times W \times \frac{C}{4}}

(1)

where

R

represents the image,

H

and

W

represent the number of height pixels and width pixels of the image, respectively, and

\frac{C}{4}

indicates that the number of channels for each element

a_{i}

is one fourth of the total number of channels

C

.

Subsequently, each head performs the depthwise separable convolution operation according to different dilation rates, and Formula (2) is obtained.

{(a_{i}^{'})}_{i = 1}^{4} \in R^{H \times W \times \frac{C}{4}}

(2)

In this context,

a_{i}^{'}

specifically denotes the corresponding features obtained by applying a depthwise dilated convolutional operation to the

i

-th head. The dilation rates of the convolutions are designated as

d_{1}, d_{2}, d_{3}, d_{4}

, respectively. Mathematically, this operation is expressed as

a_{i}^{'} = D D W C o n v (a_{i})

, where

D W C o n v

signifies the application of a depthwise dilated convolutional operation to

a_{i}

.

The MDCR module introduces a channel splitting and recombination strategy to further strengthen the feature expression. The specific details are as follows: This strategy performs a channel splitting operation on

a_{i}^{'}

, disassembling it into individual channels, then obtains the values corresponding to each head, which helps to improve the overall feature expression ability. It is specifically expressed as Equation (3):

{(a_{i}^{j})}_{j = 1}^{\frac{C}{4}} \in R^{H \times W \times 1}

(3)

where

a_{i}^{j}

represents the

j

-th channel of the

i

-th head, H and W represent the number of height pixels and width pixels of the image, respectively, and 1 indicates that the number of channels is 1.

Next, these features are cross-combined in the convolution heads to generate new feature groups, which can be expressed by Formula (4). Through this step, the diversity of multi-scale features is enriched. Here,

h_{j}

represents the

j

-th group of features and can be expressed by Formula (5):

(h_{j})_{j = 1}^{\frac{C}{4}} \in R^{H \times W \times 4}

(4)

h_{j} = W_{i n n e r} ([\begin{matrix} a_{1}^{j}, a_{2}^{j}, a_{3}^{j}, a_{4}^{j} \end{matrix}])

(5)

where

W_{i n n e r}

and

W_{o u t e r}

represent the weight matrix for pointwise convolution.

Finally, the operation of point convolution is utilized to achieve the fusion of information between groups and within groups. Then, the final output

F_{o}

is obtained, which can be expressed by Formula (6):

F_{o} = δ (B (W_{o u t e r} ([h_{1}, h_{2}, \dots, h_{j}])))

(6)

where

δ ()

represents ReLU activation function and

B ()

means batch normalization.

2.3. QS-Dot-IoU_Loss

The original loss function adopted by the YOLOv8n model is CIoU [18]. The CIoU loss function is a loss function used for evaluating the differences between predicted bounding boxes and ground truth bounding boxes in target detection, and it is widely applied in deep learning-based target detection algorithms. It is an improved version of the IOU loss function. It considers the complete intersection between target boxes and introduces a correction factor to measure the similarity between them more accurately. The definition of the CIoU loss function is shown in Formula (7):

C I o U L o s s = 1 - C I o U

(7)

Among them, the calculation formula of CIoU can be further refined into Formula (8):

C I o U = I o U - \frac{ρ^{2} (b, b^{g t})}{c^{2}} - α v

(8)

where IoU represents the Intersection over Union (IoU) between the predicted box and the ground truth box,

ρ^{2}

denotes the Euclidean distance,

b

stands for the center point coordinates of the predicted box,

b^{g t}

signifies the center point coordinates of the ground truth box,

c^{2}

indicates the square of the diagonal length of the minimum enclosing rectangle for both the predicted and ground truth boxes,

α

is a positive weight coefficient used to balance the importance of different loss terms, and

v

represents a correction factor that measures the consistency of aspect ratio.

However, the CIoU loss function mainly focuses on measuring the differences between predicted bounding boxes and ground truth bounding boxes from a geometric perspective, including aspects such as overlapping area, central position, and aspect ratio. For small targets with insignificant features, it is difficult to effectively capture the unique fine features of small targets through these geometric factors it considers. Consequently, its effectiveness is limited in guiding the model to learn the accurate bounding boxes of small targets.

To address the shortcomings of the CIoU loss function, this paper employs a hybrid loss function, namely the QS-Dot-IoU function. Instead of measuring the differences between predicted bounding boxes and ground truth bounding boxes from a geometric perspective, this loss function focuses on measuring such differences from the perspective of target shape features. It can enhance the sensitivity to target shapes and improve the detection effect on small targets. Meanwhile, this loss function also inherits the advantages of QFL [19]. It can adjust the weights according to the localization quality factor

q

, enabling the model to pay attention to both the quality of classification and localization during training, thereby improving the accuracy of detection results. The QS-Dot-IoU can be expressed as Formula (9):

D o t D^{s h a p e} = e^{- \frac{D}{S}}

(9)

where

D

represents the absolute value of the difference between the actual box width and the predicted box width, and S denotes the scale parameter.

Among them, the representations of

D

and S are shown in the following equations:

D = \sqrt{h h \times (x_{c} - x_{c}^{g t})^{2} + w w \times (y_{c} - y_{c}^{g t})^{2}}

(10)

S = \sqrt{\frac{\sum_{i = 1}^{M} \sum_{j = 1}^{N_{i}} w_{i j} \times h_{i j}}{\sum_{i = 1}^{M} N_{i}}}

(11)

where

x_{c}

is the x coordinate of the center point of the prediction box;

x_{c}^{g t}

represents the x-coordinate of the real frame center point;

y_{c}

represents the y-coordinate of the center point of the prediction box;

y_{c}^{g t}

represents the y-coordinate of the center point of the ground truth bounding box;

w w

denotes the weight coefficient in the horizontal direction;

h h

denotes the weight coefficient in the vertical direction;

N_{i}

indicates the number of targets belonging to that category;

w_{i j}

denotes the width of the

j

-th bounding box for the

i

-th target; and

h_{i j}

denotes the height of the

j

-th bounding box for the

i

-th target.

QFL-IoU can be expressed as Formula (12):

Q F L (p, u) = - {|\frac{u - σ (x)}{1 - σ (x)}|}^{β} ((1 - u)^{γ} l o g (p_{u}) + \frac{π}{4} (1 - p_{u})^{γ} \sum_{j \neq u} l o g (p_{j}))

(12)

In the preceding text,

u

denotes the Intersection over Union (IoU) between the ground truth box and the predicted box, while

σ (x)

signifies the sigmoid activation function.

γ

serves as the focus parameter that adjusts the weight of easily classified samples.

β

indicates the weight in the loss function.

p_{u}

represents the probability that the predicted object belongs to category

u

, and

p_{j}

denotes the probability that the predicted object belongs to category

j

.

Then, the hybrid loss function QS-Dot-IoU is improved based on Shape-Dot-IoU and QFL-IoU and can be expressed as Formula (13):

Q F L_{c o m b i n e d} (p, u) = - {|\frac{u \times e^{(- \frac{D}{S})} - σ (x)}{1 - σ (x)}|}^{β} ((1 - (u \times e^{(- \frac{D}{S})}))^{γ} l o g (p_{u}) + \frac{π}{4} (1 - p_{u})^{γ} \sum_{j \neq u} l o g (p_{j}))

(13)

3. Results

3.1. Experimental Environment and Implementation Details

The experiments in this paper are based on the Linux operating system. The CPU was an Intel Core TM i7-10750H. The graphics card used was an NVIDIA GeForce RTX 3080 with a video memory of 16 GB. The programming language used was Python 3.9. The deep learning framework was PyTorch 1.8.2. CUDA 11.4 and cuDNN 8.54 were installed simultaneously to support the use of the GPU. Through a large number of experiments, it was concluded that the model can reach convergence at the 150th iteration. The hyperparameters of the model set in this experiment are shown in Table 1.

Regarding the architecture configuration, this paper used YOLOv8’s CSPDarknet as the backbone network, innovated the C2f module, and added a CEPN convergence diffusion pyramid network to enhance multi-scale feature fusion. The last layer used the QS-Dot-IoU hybrid loss function to optimize the regression of the bounding box.

Regarding the parameter tuning process, this paper used the grid search method to tune the learning rate (between 0.01 and 0.1), batch size (between 16 and 64), and number of iterations (50 to 200 epochs). The final hyperparameters selected were learning rate (0.01), batch size (16), and number of iterations (150 epochs).

3.2. Experimental Data

In the experimental section of this chapter, three datasets were selected as the research objects. They include two standard datasets from the public domain, the TACO and VIA-IMG datasets, as well as a scenic area waste dataset constructed independently.

The TACO dataset is a small target object waste dataset containing 6004 images in total, covering 18 categories. It is sourced from the Kaggle platform. The VIA-IMG dataset encompasses large, medium, and small targets, with 7963 images falling into seven categories. It comes from the Roboflow platform.

To thoroughly verify the performance of the algorithm proposed in this paper in practical application scenarios, a self-constructed waste dataset was built, based on which the performance of the algorithm was tested. Meanwhile, this test also focused on examining the algorithm’s detection ability for small waste objects.

The self-constructed waste dataset mainly includes common waste types found in scenic areas. It was created through three approaches: taking pictures with mobile phones, collecting data via web crawlers, and obtaining data from scenic area websites. The waste types in the dataset include plastic bottles, bottle caps, metals, plastic bags, and paper. Subsequently, data augmentation techniques were applied to expand the training dataset to 3300 images, enabling the algorithm’s generalization ability to be tested. The techniques include adjusting the brightness of images, performing horizontal and vertical flipping, scaling, random cropping, adding random noise, and other operations. The schematic diagram of image enhancement is shown in Figure 7.

Finally, the dataset was divided into a training set, a test set, and a validation set at a ratio of 7:2:1. The pictures were randomly partitioned.

3.3. Model Evaluation Index

In this paper, the commonly used metrics in object detection are adopted as the evaluation metrics for the detection algorithm model, mainly including Average Precision (AP), mean Average Precision (mAP), the number of parameters, Giga Floating-point Operations Per Second (GFLOPs), and Frames Per Second (FPS). Generally speaking, the higher the AP and mAP metrics are, the better the detection ability of the algorithm is. When analyzing the lightweight issue, GFLOPs and parameters are used as the lightweight evaluation metrics to measure the algorithm’s computational complexity. In this paper, mAP with an IoU threshold of 0.5 is employed.

The formulas for the model evaluation indicators can be expressed as Formulas (14)–(17):

A P = \int_{0}^{1} P (R) d (R)

(14)

m A P = \frac{1}{N} \sum_{i = 1}^{N} {A P}_{i}

(15)

P = \frac{T P}{T P + F P}

(16)

R = \frac{T P}{T P + F N}

(17)

Among them, TP, FP, and FN represent the numbers of true positives, false positives, and false negatives, respectively.

P (R)

denotes the precision value

R

corresponding to a specific recall rate value

P

, and

{A P}_{i}

represents the average precision of the

i

-th category.

3.4. Experimental Results

3.4.1. Comparative Study of Quantitative Experimental Data Under the TACO Dataset

The TACO dataset encompasses numerous waste image examples in various natural environment scenarios, such as forests, roads, and beaches. This characteristic ensures that the dataset is highly diverse and widely representative. Moreover, most of the target objects in this dataset are small-sized. In this study, by comparing the experimental result data of different algorithms on the TACO dataset, we aimed to verify the applicability of the proposed algorithm in complex scenarios and its detection effectiveness for small targets. The detailed comparison results of relevant data are shown in Table 2.

This research carried out a comprehensive and systematic comparative analysis of mainstream object detection algorithm models in a unified experimental environment. According to the data shown in Table 2, when the input image size is 300 × 300 pixels, compared with the two-stage algorithm Faster RCNN [20], the algorithm proposed in this study achieves a significant improvement of 6.5 percentage points in mAP. Furthermore, compared with algorithms such as YOLOv5s, YOLOv7-tiny [21], YOLOv9s [22], DETR-R50 [23], and YOLOv8n-GL [24], the accuracy of the algorithm in this study is increased by 5.7 percentage points, 10.5 percentage points, 5.1 percentage points, 2.8 percentage points, and 7.7 percentage points, respectively. To achieve the lightweight of the algorithm, YOLOv8n, with the lowest number of parameters, was selected as the base model in this study, and further optimization was carried out on this basis. Under the condition that the input image size is 300 × 300 pixels, the number of parameters of the optimized algorithm is only 2.2 M, which is 0.6 M less than that of the base model. This is because the C2f module in the original model was improved. By replacing the Bottleneck module with the MS module, the parameter redundancy caused by the Bottleneck module in processing large datasets was reduced, thus decreasing the number of parameters. Compared with the DETR series of algorithms, the algorithm in this study shows significant advantages in both the number of parameters and the amount of floating-point operations.

Table 2. Comparative experimental results of TACO dataset.

Method	Image Size	mAP (%)	Params/M	GFLOPs/S	FPS/S
Faster RCNN	1000 × 600	71.4	131.1	370.2	9.5
EfficientDet	640 × 640	53.2	4.0	10.5	74.8
YOLOv5s	500 × 500	72.2	7.4	16.1	42.2
YOLO7-tiny	500 × 500	67.4	6.1	13.1	43.5
YOLOv8n	640 × 640	73.4	3.1	9.3	54.5
YOLOv8n	300 × 300	72.8	2.8	8.2	72.4
YOLOv9-S	640 × 640	68.3	15.6	26.4	70.6
YOLOv10-S [25]	640 × 640	75.1	20.8	31.2	75.0
YOLOv11n	640 × 640	77.2	2.4	7.2	82.6
DETR-R50	1200 × 800	65.3	41.5	136.2	98.1
DN-DETR-R50	1200 × 800	74.4	44.3	157.6	102.3
RT-DETR-l [26]	1200 × 800	76.9	22.2	105.6	118.5
YOLOv8n-GL	300 × 300	75.6	3.02	8.8	60.8
MS-YOLO	300 × 300	77.9	2.2	6.4	95.3

In summary, the algorithm proposed in this research achieves a mAP of 77.9%, with the number of parameters controlled at 2.2 M and the amount of floating-point operations being only 6.4 S. This algorithm not only improves the ability to detect small targets but also realizes a lightweight design. Moreover, with an FPS of 95.3, it can meet real-time requirements in scenic area scenarios and provides strong support for practical applications.

To present the performance advantages of the improved algorithm in this paper over other mainstream algorithms more intuitively, Figure 8 shows the curve graph of the changes in mAP in the comparative experiments with mainstream detection algorithms.

This paper adopted a step-by-step verification approach to verify the effectiveness of the algorithm improvement, and specific experiments were conducted on the improved modules. The C2f-MS convolution module, the CEPN feature fusion module, and the hybrid loss function QS-Dot-IoU were introduced into the TACO dataset, respectively, and a series of ablation experiments were created. The results of the ablation experiments are shown in Table 3.

The results of the ablation experiments indicate that taking YOLOv8n as the base model, the algorithm has a mAP of 72.8%, the number of parameters is 2.8M, and the amount of floating-point operations is 9.3S. After introducing the C2f-MS module to replace the C2f module, the mAP is increased by 2.9 percentage points and the number of parameters is reduced by 0.3M. The C2f-MS module is more lightweight because it can retain the details of multi-scale original features, improve the object recognition accuracy of multi-scale features and the detection ability of small targets, and reduce parameter redundancy. Using the CEPN feature fusion network to replace the original network, the mAP is further increased by 2.7 percentage points. The number of parameters is significantly reduced and the consumption of computing resources is decreased. Thanks to the simple structure, skip connections, and reduced redundancy of CEPN, it can solve the problem of semantic feature loss, capture wide-area semantic features and small target details, make up for the deficiency of small target detection of the original model, and achieve a lighter volume. Replacing the CIoU with the QS-Dot-IoU loss function increases the mAP by 1.5 percentage points, which shows that the hybrid loss function can enhance the small target localization ability, obtain more detailed features, and improve the detection effect. After the comprehensive improvements, the mAP is increased from 72.8 to 77.9, with a total increase of 5.1 percentage points, the number of parameters is reduced from 2.8M to 2.2M, and the GFOLPs is reduced from 9.3S to 6.4S, which confirms the effectiveness of the improvements of the C2f-MS module, the CEPN feature fusion network, and the hybrid loss function.

Finally, a comparative experiment of loss functions was conducted and the results are summarized in Table 4.

According to the comparative experimental results in Table 4, under the premise of maintaining the number of parameters and GFLOPs of the benchmark model, the impact on the detection performance is evaluated by replacing different loss functions GIoU, SIoU, CIoU, DIoU, and QS-Dot-IoU. Specifically, after adopting the GIoU loss function, the detection accuracy increased by 1.2 percentage points compared with the original loss function CIoU. The SIoU loss function brought an accuracy increase of 0.4 percentage points. After adopting the DIoU loss function, the detection accuracy decreased by 1.3 percentage points compared with the original loss function. Most significantly, the QS-Dot-IoU loss function proposed in this paper achieved an accuracy improvement of 1.9 percentage points after replacing the CIoU loss function in the benchmark model.

In summary, the QS-Dot-IoU loss function does not measure the difference between the predicted box and the true box from a general geometric perspective, but focuses on the difference between the predicted box and the true box from the perspective of the target shape features, which can enhance the sensitivity to the target shape and improve the detection effect of small targets. At the same time, this loss function also inherits the advantages of QFL and can adjust the weight according to the positioning quality factor

q

, so that the model can focus on the quality of classification and positioning at the same time during training, thereby improving the accuracy of the detection results. Therefore, the QS-Dot-IoU loss function is an effective means to improve the performance of the detection model.

3.4.2. Comparative Study of Quantitative Experimental Data Under the VIA-Img Dataset

The VIA-Img dataset mainly consists of seven categories, with a total of 7963 images, including targets of different scales such as large, medium, and small. In this paper, the entire dataset was first verified. The experimental results and the corresponding change curve graphs are shown in Table 5 and Figure 9, respectively.

As can be seen from Table 4, the mAP of the algorithm in this paper is 73.9%, which is superior to other algorithms. Compared with Faster RCNN, YOLOv5s, YOLO7-tiny, YOLOv8n, YOLOv9-S, YOLOv10-S, and YOLOv8n-GL, the detection accuracy is increased by 2.3 percentage points, 7.1 percentage points, 4.7 percentage points, 3.6 percentage points, 1.8 percentage points, and 2.4 percentage points, respectively. This proves that the algorithm proposed in this paper has an excellent ability to detect multi-scale targets. Moreover, in terms of model parameters, compared with the base model, the number of parameters of the algorithm in this paper is reduced by 1.2 M, and the GFLOPs are reduced by 1.3 s, realizing the lightweight design of the algorithm.

In the general dataset MS COCO, small targets are defined as objects with a resolution of less than 32 × 32 pixels, medium targets are defined as objects with a resolution ranging from 32 × 32 pixels to 96 × 96 pixels, and large targets are defined as objects with a resolution exceeding 96 × 96 pixels. Based on these definitions, we distinguish between large, medium, and small targets in both the self-made garbage dataset and the VIA-Img dataset.

To test the effect on small target detection, the targets of different scales (large, medium, and small) in this dataset were verified. The experimental results are shown in Table 6.

As shown in Table 5, compared with YOLOv8n, the APS of the MS-YOLO algorithm proposed in this paper is increased by 6.4 percentage points in small target detection, which proves the algorithm’s effectiveness in small target detection. When detecting medium and large targets, the APM and APL values are increased by 3.7 percentage points and 1.5 percentage points, respectively, which proves that the MS-YOLO algorithm is also effective in multi-scale target detection.

3.4.3. Comparative Study of Quantitative Experimental Data Under the Self-Made Dataset

This article compares the performance of various algorithms on a self-generated garbage dataset to test the generalization ability of the MS-YOLO algorithm. The results of the comparison of the experimental data are shown in Table 7.

According to Table 7, compared with the algorithms such as Faster RCNN, YOLOv5s, YOLO7-tiny, YOLOv8n, YOLOv9-S, YOLOv10-S, and YOLOv8n-GL, the mAP value of the algorithm proposed in this study is increased by 8.2 percentage points, 4.4 percentage points, 1.5 percentage points, 1.9 percentage points, 0.3 percentage points, 2.8 percentage points, and 1.2 percentage points, respectively. The proposed algorithm has achieved the optimal mAP value in the self-constructed waste dataset, thus demonstrating better robustness and generalization ability.

To more intuitively present the performance advantages of the improved algorithm in this paper over other mainstream algorithms in the self-constructed waste dataset, Figure 10 shows the change curve graph in the comparative experiments with mainstream detection algorithms.

To verify the advantages of using the SAK attention mechanism in the C2f-MS module, multiple attention mechanisms, such as CAM [27], CBAM [28], and SE [29], were introduced at the same position as C2f-MS for comparative testing. The experimental results are summarized in Table 8. According to the experimental data, compared with the base model, introducing the C2f-M-CBAM and C2f-M-SE modules reduces the mAP by 2.3 percentage points and 1.6 percentage points, respectively. However, the addition of the C2f-M-CAM and C2f-M-SAK modules increases the mAP by 0.1 and 0.3 percentage points, respectively, which proves the effectiveness of using the SAK attention mechanism in enhancing the performance of the C2f-MS module. The introduction of the CBAM and SE modules fails to achieve the expected results. The reason may be that they have limitations in information processing. The SE module ignores spatial position information, while the CBAM module has difficulty constructing long-range dependencies. However, SAK fuses information through multiple branches, enhances the multi-resolution feature expression ability, processes features using a multi-scale sub-network and cascades the outputs, and captures the association between global and local information.

Furthermore, a heatmap experiment was conducted on the self-constructed waste dataset, and the results are shown in Figure 11. Figure 11a presents the original image, while Figure 11b displays the heatmap generated without adding an attention mechanism. It can be clearly observed from the figure that the algorithm’s positioning of the key regions is not precise and shows a certain degree of dispersion. In contrast, Figure 11c is the heatmap generated after introducing the attention mechanism. This figure demonstrates that by incorporating the SAK attention mechanism, the C2f-MS module significantly enhances the algorithm’s focus on the target detection task and enables it to concentrate more precisely on the target regions to be detected. Thus, the effectiveness of the SAK attention mechanism in enhancing the module’s performance is verified.

3.4.4. Qualitative Comparative Study

To intuitively measure the efficacy of the improved algorithm proposed in this chapter on the task of domestic waste classification and detection, a qualitative evaluation experiment on the test set was proposed and implemented in this section. The TACO dataset, the VIA-Img dataset, and the self-constructed waste dataset were selected. The experimental results are presented intuitively through Figure 12, which compares the detection effects of the traditional baseline model and the improved algorithm model proposed in this chapter.

By observing the detection results of YOLOv8n (shown in Figure 12b), it is evident that the algorithm performs poorly in small target detection under different scenarios. For the “Plastic Bottle” category in the TACO and the VIA-Img datasets, the mAP values are 71% and 51%, respectively. In the self-constructed dataset, the YOLOv8n algorithm has misdetection issues for the “BottleCup” category. This is because when extracting features from small target images, YOLOv8n is easily interfered with by the cluttered background, resulting in misdetection phenomena. From the detection results under the MS-YOLO algorithm (shown in Figure 12c), it can be seen that compared with the YOLOv8n algorithm, the mAP values in the TACO dataset and the VIA-Img dataset are increased by 15 percentage points and 34 percentage points, respectively. In the self-constructed dataset, the algorithm proposed in this paper not only correctly identifies the categories of the detected objects but also achieves a mAP of 85%. Therefore, it is proved that the algorithm proposed in this paper has no misdetection of targets and its detection effect is superior to that of the base model.

In the performance evaluation of medium and large targets, this algorithm shows excellent performance. As shown in the fourth and fifth rows, in a complex and changing environment, the recognition accuracy of the original model is 82%, while the recognition accuracy of the improved algorithm in this paper is improved to 85%, a significant increase of 3 percentage points. In an occluded environment, compared with the original model, the recognition accuracy of each target of the improved algorithm proposed in this paper has been improved to varying degrees. In summary, the improved algorithm in this paper also shows good performance in a cluttered and occluded environment.

In the small target detection task, the MS-YOLO algorithm shows significant advantages. Compared with the base model YOLOv8n algorithm, its small target detection accuracy is substantially improved, and the false detection rate is effectively reduced. It can more accurately identify small targets in complex backgrounds. The main reason for the performance improvement is adopting a hybrid loss function strategy. This loss function focuses on considering the difference between the predicted box and the actual box from the perspective of target shape features, enhancing the detection efficiency for shape-sensitive and small targets. Moreover, it weights the category score according to the localization quality, enabling the detection model to focus on difficult samples to localize or classify during training, suppressing the interference of complex backgrounds and improving the accuracy of small target detection.

Simultaneously, the C2f-MS module uses an optimization strategy of retaining partial transmissions. This optimization strategy retains the details of the original features at different scales, significantly improving the recognition accuracy of objects with multi-scale features. Introducing the SAK module, with its unique adaptive selection ability, can selectively enhance the feature representation of key regions, thereby improving the localization accuracy of the algorithm for small target object information. In addition, the CEPN feature fusion module focuses on the extraction layer. Then, through the propagation of the connection layer and the skipping of layers, it captures semantic features in a wider area, enhancing the capture of detailed information of small targets and improving the small target detection ability.

3.4.5. Comparative Study of Generalization Experiments

To verify the extensive applicability of the algorithm proposed in this paper in waste classification and detection tasks, we selected the open-source datasets HGI-30 [30] and TrashNet [31] for experimental comparison. We mainly compared the benchmark model and the improved algorithm based on the benchmark model. The experimental results are summarized in Table 9 and Table 10, respectively.

Under the HGI-30 dataset, the mAP of the model proposed in this paper is 93.2%, which is higher than that of other models. When comparing the number of parameters with others, the improved algorithm’s number of parameters is only 2.2M, which simplifies the model complexity.

Under the TrashNet dataset, the mAP of the model proposed in this paper is 95.2%, which is higher than that of other models. When comparing the number of parameters with others, the number of parameters of the proposed model is only 2.2 M, achieving a significant reduction in model complexity.

The performance improvements on the two open-source datasets are mainly attributed to the three improvements proposed in this paper. Firstly, the MS module is designed. This module aims to reduce model complexity while ensuring performance by selectively applying convolution kernels of different sizes, making the model more lightweight and efficient. At the same time, the SAK attention mechanism is incorporated. Through this mechanism, convolution kernels of different sizes can be dynamically selected, and the receptive field size can be adaptively adjusted, thereby obtaining more scale information features of small target objects and improving the detection accuracy of small target objects. In addition, the CEPN convergent diffusion pyramid network module is proposed. The MDCR module and the diffusion mechanism are used to solve the problem of semantic feature loss and enhance the capture of detailed information on small targets. Finally, the hybrid loss QS-Dot-IoU function is used. This loss function abandons the geometric perspective and focuses on target shape features to measure the difference between the predicted box and the actual box, effectively improving the sensitivity to target shapes and facilitating small target detection. Meanwhile, inheriting the advantages of QFL, the weight is adjusted according to the localization quality factor q, prompting the model training to balance classification and localization quality and enhancing the accuracy of detection results.

In summary, the improved algorithm in this paper achieves the best performance on both the HGI-30 dataset and the TrashNet dataset, thus demonstrating the proposed algorithm’s better generalization ability.

3.4.6. Algorithm Deployment

To address the issue of waste in scenic areas and demonstrate the practicality of the algorithm in this paper, the detection algorithm proposed herein is deployed on a scenic area mobile robot. The robot collects and classifies waste in scenic areas, replacing manual operations and enhancing efficiency. The mobile robot mainly consists of components such as a deep learning camera, a robotic arm, a mobile chassis, a radar, a controller, and motors.

The experimental process is as follows: (1) Calibrate the focal length of the depth camera to obtain the internal parameters of the camera. Perform hand–eye calibration and depth camera calibration, and calculate the transformation matrices between the robotic arm base and the camera coordinate system and between different camera coordinate systems. (2) Utilize the laser radar mounted on the robot to construct a two-dimensional map of the experimental environment. Set fixed patrol points on the existing map, and based on the environmental map and the path calculated by motion planning, the robot can perform closed-loop autonomous obstacle avoidance movement and conduct environmental detection. (3) When the camera detects waste, the robot moves to a position near the target and stops. Employ the deep learning network model to perform waste classification detection and complete three-dimensional spatial positioning. (4) Calculate the position of the target in the robot coordinate system according to the hand–eye calibration results, derive the target angles of each joint, guide the robotic arm to move to the target pose, and simultaneously control the gripper to close, complete the grasping, and place the waste into the designated trash bin.

The experiment placed the mobile robot in an outdoor park scenario and conducted a grasping experiment using the waste detection algorithm in this paper. The following is the process of object grasping, as shown in Figure 13. The mobile robot with the deployed algorithm in this paper can accurately grasp waste objects. The detection and recognition success rate is above 87%, and the average pickup time is 5.3 s. The improved algorithm’s real-time performance can meet waste pickup’s operational requirements in scenic areas.

4. Discussion

Currently, waste disposal in scenic areas predominantly relies on manual methods. However, this handling mode poses health risks to workers. For instance, they may suffer physical injuries due to exposure to harmful substances in various types of waste or working in complex environments. From another perspective, although artificial intelligence technology offers the potential to address the deficiencies of manual handling methods, traditional detection techniques have limitations in detecting small objects when applied to waste detection in scenic areas, resulting in the overall detection effect failing to reach an ideal state. This paper proposes an improved detection algorithm based on YOLOv8n, namely the MS-YOLO algorithm, and deploys it on a mobile cart. Firstly, a C2f-MS module is proposed to replace the convolutional module of the backbone network. This module retains the details of the original features at different scales, enhancing the small target detection ability and simultaneously reducing the number of model parameters. Secondly, the neck network is redesigned, and a CEPN is proposed. Strengthening the semantic feature information during transmission improves the capture of detailed information about small targets and realizes effective detection of small targets. Finally, a QS-Dot-IoU hybrid loss function is proposed. The hybrid loss function focuses on measuring the difference between the predicted box and the real box from the perspective of target shape features, which can enhance the sensitivity to target shapes, pay attention to both classification and localization simultaneously, improve the detection effect of small targets, and reduce the occurrence of false detection. Experimental results demonstrate that compared with other detection algorithms, the proposed algorithm performs better in detecting small targets and has a lighter volume.

5. Conclusions

The experimental results show that in the tiny target object waste dataset TACO, the mAP of the proposed algorithm reaches 77.9%, the parameters are controlled at 2.2M, and the floating-point operation amount is only 6.4 S. This algorithm not only improves the small target detection ability but also realizes a lightweight design, which is suitable for deployment in terminal devices with limited computing power. Moreover, an FPS of 88.3 can meet the real-time requirements in scenic area scenarios and provide strong support for practical applications.

However, the shortcoming is that the proposed algorithm does not consider the impact on detection accuracy in the nighttime working scenario. In future work, a dataset in the nighttime background will be collected to enhance further the algorithm’s detection ability in the nighttime background.

Regarding development, the algorithm proposed in the future will further optimize the algorithm structure according to the characteristics of small target detection, reduce the amount of calculation and storage requirements, and improve the real-time efficiency of the algorithm. In addition, it can be combined with other sensor data in the scenic area, such as audio information in video surveillance, meteorological data in environmental sensors, etc., and use multimodal information for complementarity to more comprehensively understand the scenic scene and improve the accuracy of small target detection.

Regarding prospects, first of all, with the development of tourism, the demand for scenic area safety and tourist management is increasing. The detection algorithm proposed in this paper can be applied to scenic area monitoring, tourist behavior analysis, and other fields, and has broad application prospects. Secondly, it can also gradually realize commercial promotion and provide intelligent and efficient management services for scenic spots. Finally, researchers can cooperate with technology companies to promote the development and application of related technologies such as computer vision and artificial intelligence, attract more technology companies to participate in the intelligent construction of scenic spots, form a good situation of industrial collaborative innovation, and jointly promote the digital transformation of the tourism industry.

Author Contributions

Y.Z. wrote the original paper and conducted all the experiments. Software, L.W. All authors have read and agreed to the published version of the manuscript.

Funding

1. Basic Scientific Research Business Expenses of Hebei Provincial Universities JST2022001, 2. Tangshan Science and Technology Project 22130225G, 3. Tangshan Science and Technology Project 24140202C, 4. Basic scientific research operating expenses of provincial universities JJC2024036.

Data Availability Statement

The TACO dataset and VIA-Img dataset provided in the study can be found at https://universe.roboflow.com/waste-detector-2t9ly/cardboard-bottlecap-glassbottle-plasticbottle-plasticbag-garbagebag-battery-tin/dataset/6/download (accessed on 10 August 2024). The self-made private dataset is not public, and thus not available.

Acknowledgments

The authors would like to express gratitude to all participants in this study for their assistance in the experimental section and for providing time to review this article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Cong, S.; Zhou, Y. A review of convolutional neural network architectures and their optimizations. Artif. Intell. Rev. 2023, 56, 1905–1969. [Google Scholar] [CrossRef]
Jeon, I.-S.; Kang, S.J.; Kang, S.-J. A Staged Framework for Computer Vision Education: Integrating AI, Data Science, and Computational Thinking. Appl. Sci. 2024, 14, 9792. [Google Scholar] [CrossRef]
Bhatt, D.; Patel, C.; Talsania, H.; Patel, J.; Vaghela, R.; Pandya, S.; Modi, K.; Ghayvat, H. CNN Variants for Computer Vision: History, Architecture, Application, Challenges and Future Scope. Electronics 2021, 10, 2470. [Google Scholar] [CrossRef]
Tu, C.F.; Yi, A.L.; Yao, T. High precision garbage detection algorithm for lightweight YOLOv5n. Comput. Eng. Appl. 2023, 59, 187–195. [Google Scholar] [CrossRef]
Li, Y.; Gou, G. Classification and detection method of light waste based on improved YOLOX. J. Guangxi Norm. Univ. Nat. Sci. Ed. 2023, 41, 80–90. [Google Scholar] [CrossRef]
Yi, C.J.; Chen, J.; Wang, S.W. Image classification of domestic waste based on lightweight convolutional neural network. Softw. Eng. 2023, 26, 41–45. [Google Scholar] [CrossRef]
Alsubaei, F.S.; Al-Wesabi, F.N.; Hilal, A.M. Deep learning-based small object detection and classification model for garbage waste management in smart cities and iot environment. Appl. Sci. 2022, 12, 2281. [Google Scholar] [CrossRef]
Sreelakshmi, K.; Akarsh, S.; Vinayakumar, R. Capsule neural networks and visualization for segregation of plastic and non-plastic wastes. In Proceedings of the 2019 5th International Conference on Advanced Computing & Communication Systems (ICACCS), Coimbatore, India, 15–16 March 2019; pp. 631–636. [Google Scholar] [CrossRef]
Qin, J.; Yu, W.; Feng, X.; Meng, Z.; Tan, C. A UAV Aerial Image Target Detection Algorithm Based on YOLOv7 Improved Model. Electronics 2024, 13, 3277. [Google Scholar] [CrossRef]
Zhang, Q.; Zhou, L.; An, J. Real-Time Recognition Algorithm of Small Target for UAV Infrared Detection. Sensors 2024, 24, 3075. [Google Scholar] [CrossRef] [PubMed]
Yang, R.; Zhang, J.; Shang, X.; Li, W. Lightweight Small Target Detection Algorithm with Multi-Feature Fusion. Electronics 2023, 12, 2739. [Google Scholar] [CrossRef]
Piao, Y.; Jiang, Y.; Zhang, M. PANet: Patch-Aware Network for Light Field Salient Object Detection. IEEE Trans. Cybern. 2023, 53, 379–391. [Google Scholar] [CrossRef]
Hu, Z.; Zhang, Y.; Xing, Y.; Li, Q.; Lv, C. An Integrated Framework for Multi-State Driver Monitoring Using Heterogeneous Loss and Attention-Based Feature Decoupling. Sensors 2022, 22, 7415. [Google Scholar] [CrossRef] [PubMed]
Li, H.; Li, J.; Guan, X. Research on Overfitting of Deep Learning. In Proceedings of the 2019 15th International Conference on Computational Intelligence and Security (CIS), Macao, China, 13–16 December 2019; pp. 78–81. [Google Scholar] [CrossRef]
Zheng, Z.; Zha, B.; Zhou, Y.; Huang, J.; Xuchen, Y.; Zhang, H. Single-Stage Adaptive Multi-Scale Point Cloud Noise Filtering Algorithm Based on Feature Information. Remote Sens. 2022, 14, 367. [Google Scholar] [CrossRef]
Li, Y.; Hou, Q.; Zheng, Z. Large Selective Kernel Network for Remote Sensing Object Detection. arXiv 2023, arXiv:2303.09030. [Google Scholar]
Elizar, E.; Zulkifley, M.A.; Muharar, R.; Zaman, M.H.M.; Mustaza, S.M. A Review on Multiscale-Deep-Learning Applications. Sensors 2022, 22, 7384. [Google Scholar] [CrossRef]
Tian, Y.; Su, D.; Lauria, S. Recent advances on loss functions in deep learning for computer vision. Neurocomputing 2022, 497, 129–158. [Google Scholar] [CrossRef]
Li, X.; Lv, C.; Wang, W. Generalized focal loss: Towards efficient representation learning for dense object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 3139–3153. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 7464–7475. [Google Scholar]
Wang, C.Y.; Yeh, I.H.; Mark Liao, H.Y. Yolov9: Learning what you want to learn using programmable gradient information. In Proceedings of the European Conference on Computer Vision (ECCV), Milan, Italy, 29 September–4 October 2024; p. 15089. [Google Scholar] [CrossRef]
Zhao, Y.; Lv, W.; Xu, S. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, DC, USA, 17–21 June 2024; pp. 16965–16974. [Google Scholar]
Luo, Z.C.; He, C.T.; Chen, D.J. Passion fruit rapid detection model based on lightweight YOLOv8. J. Agric. Mach. 2024, 3, 1–12. [Google Scholar] [CrossRef]
Wang, A.; Chen, H.; Liu, L. Yolov10: Real-time end-to-end object detection. arXiv 2024, arXiv:2405.14458. [Google Scholar] [CrossRef]
Lv, W.; Zhao, Y.; Chang, Q. Rt-detrv2: Improved baseline with bag-of-freebies for real-time detection transformer. arXiv 2024, arXiv:2407.17140. [Google Scholar] [CrossRef]
Chen, R.; Liu, Z.; Ou, W.; Zhang, K. Small Target Detection Algorithm Based on Improved YOLOv5. Electronics 2024, 13, 4158. [Google Scholar] [CrossRef]
Wang, Z.; Zhou, W.; Li, Y. GFN: A Garbage Classification Fusion Network Incorporating Multiple Attention Mechanisms. Electronics 2025, 14, 75. [Google Scholar] [CrossRef]
Wu, B.; Xiong, X.; Wang, Y. Real-Time Semantic Segmentation Algorithm for Street Scenes Based on Attention Mechanism and Feature Fusion. Electronics 2024, 13, 3699. [Google Scholar] [CrossRef]
Wu, Z.; Li, H.; Wang, X. New benchmark for household garbage image recognition. Tsinghua Sci. Technol. 2022, 27, 793–803. [Google Scholar] [CrossRef]
Aral, R.A.; Keskin, Ş.R.; Kaya, M. Classification of trashnet dataset based on deep learning models. In Proceedings of the 2018 IEEE International Conference on Big Data (Big Data), Seattle, WA, USA, 10–13 December 2018; pp. 2058–2062. [Google Scholar] [CrossRef]

Figure 1. Improved YOLOv8n network structure.

Figure 2. C2f structural diagram.

Figure 3. MS module structural diagram.

Figure 4. C2f-MS module structure diagram.

Figure 5. CEPN convergence diffusion pyramid network module.

Figure 6. MDCR module structure diagram.

Figure 7. Schematic diagram of image enhancement method. (a) Original image; (b) horizontal rotation; (c) random brightness; (d) random noise.

Figure 8. mAP variation curve of comparative experiments under TACO dataset.

Figure 9. mAP variation curve of comparative experiments under VIA Img dataset.

Figure 10. mAP variation curve of comparative experiments under self-made dataset.

Figure 11. Comparison of experiment results of the heat map. (a) Original image; (b) attention heatmap not added; (c) attention heatmap added.

Figure 12. Comparative result graphs of quantitative experiments under three different datasets. (a) Original image; (b) YOLOv8n detection effect; (c) MS-YOLO detection effect.

Figure 13. Experiment with mobile car grasping in real scenic scenes. (a) Initial state; (b) grab the object; (c) scraping process; (d) crawling is complete.

Table 1. Hyperparameters of model.

Name	Parameter Value
Learning rate	0.01
Iterations	150
Training batch size	16
Optimizer	SGD
Optimizer momentum	0.937
Input image size	300 × 300

Table 3. Results of ablation experiment.

Method	The First Group	The Second Group	The Third Group	The Fourth Group
YOLOv8n	√	√	√	√
C2f-MS		√	√	√
CEPN			√	√
QS-Dot-IoU				√
mAP (%)	72.8	75.7	76.3	77.9
Params/M	2.8	2.5	2.2	2.2
GFOLPs/S	9.3	8.6	6.4	6.4

Table 4. Loss function comparison experiment.

Method	IoU	mAP (%)	Params/M	GFLOPs/S
YOLO v8n+C2f-MS+CEPN	GIoU	77.5	2.2	6.4
	SIoU	74.9	2.2	6.4
	CIoU	76.3	2.2	6.4
	DIoU	75.0	2.2	6.4
	QS-Dot-IoU	77.9	2.2	6.4

Table 5. Comparison experiment results of VIA-Img dataset.

Method	mAP (%)	Params/M	GFLOPs/S
Faster RCNN	71.6	127.3	363.5
EfficientDet	70.54	4.0	10.5
YOLOv5s	66.8	7.8	16.3
YOLO7-tiny	69.2	6.4	13.6
YOLOv8n	70.3	3.6	9.7
YOLOv9-S	68.4	15.8	26.8
YOLOv10-S	72.1	21.4	33.6
YOLOv11n	71.6	3.5	11.2
YOLOv8n-GL	71.5	3.2	9.3
MS-YOLO	73.9	2.9	8.4

Table 6. Comparison of experimental results of targets at different scales in VIA-Img dataset.

Method	AP_S (%)	AP_M (%)	AP_L (%)
YOLOv8n	58.9	75.6	91.3
MS-YOLO	65.3	79.3	92.8

Table 7. Comparison of experimental results of self-made garbage dataset.

Method	mAP (%)	Params/M	GFLOPs/S
Faster RCNN	68.5	115.2	342.6
EfficientDet	72.8	4.0	10.5
YOLOv5s	72.3	6.7	14.2
YOLO7-tiny	65.2	5.7	12.3
YOLOv8n	74.8	4.3	10.7
YOLOv9-S	76.4	12.6	23.3
YOLOv10-S	73.9	17.8	30.2
YOLOv11n	75.6	3.3	9.2
YOLOv8n-GL	75.5	3.5	9.8
MS-YOLO	76.7	3.1	8.9

Table 8. Comparison of attention mechanisms experimental results.

Method	mAP (%)	Params/M	GFLOPs/S
C2f-MS	74.8	4.3	10.7
C2f-MS-CAM	74.9	4.4	10.7
C2f-MS-CBAM	72.5	4.6	11.2
C2f-MS-SE	73.2	4.2	10.5
C2f-MS-SAK	75.1	4.4	11.3

Table 9. Experimental comparison results on HGI-30 dataset.

Method	mAP (%)	Params/M	GFLOPs/S
Faster RCNN	89.5	113.4	340.2
YOLOv5s	89.0	7.4	16.1
YOLOv8n	92.3	3.1	9.3
YOLOv8n-GL	91.7	4.0	7.8
YOLOv9-S	90.3	17.1	28.4
MS-YOLO	93.2	2.1	6.3

Table 10. Experimental comparison results on TrashNet dataset.

Method	mAP (%)	Params/M	GFLOPs/S
Faster RCNN	92.3	137.1	370.2
YOLOv5s	91.9	6.4	13.1
YOLOv8n	93.8	2.8	8.3
YOLOv8n-GL	93.4	3.2	9.5
YOLOv9-S	92.6	14.2	24.4
MS-YOLO	95.2	1.9	5.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, Y.; Wang, L. Research on Lightweight Scenic Area Detection Algorithm Based on Small Targets. Electronics 2025, 14, 356. https://doi.org/10.3390/electronics14020356

AMA Style

Zhang Y, Wang L. Research on Lightweight Scenic Area Detection Algorithm Based on Small Targets. Electronics. 2025; 14(2):356. https://doi.org/10.3390/electronics14020356

Chicago/Turabian Style

Zhang, Yu, and Liya Wang. 2025. "Research on Lightweight Scenic Area Detection Algorithm Based on Small Targets" Electronics 14, no. 2: 356. https://doi.org/10.3390/electronics14020356

APA Style

Zhang, Y., & Wang, L. (2025). Research on Lightweight Scenic Area Detection Algorithm Based on Small Targets. Electronics, 14(2), 356. https://doi.org/10.3390/electronics14020356

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research on Lightweight Scenic Area Detection Algorithm Based on Small Targets

Abstract

1. Introduction

2. Materials and Methods

2.1. Backbone Network Convolution Module

2.2. Improvement of Feature Extraction Fusion Network

2.3. QS-Dot-IoU_Loss

3. Results

3.1. Experimental Environment and Implementation Details

3.2. Experimental Data

3.3. Model Evaluation Index

3.4. Experimental Results

3.4.1. Comparative Study of Quantitative Experimental Data Under the TACO Dataset

3.4.2. Comparative Study of Quantitative Experimental Data Under the VIA-Img Dataset

3.4.3. Comparative Study of Quantitative Experimental Data Under the Self-Made Dataset

3.4.4. Qualitative Comparative Study

3.4.5. Comparative Study of Generalization Experiments

3.4.6. Algorithm Deployment

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI