Solid Waste Detection Using Enhanced YOLOv8 Lightweight Convolutional Neural Networks

Li, Pan; Xu, Jiayin; Liu, Shenbo

doi:10.3390/math12142185

Open AccessArticle

Solid Waste Detection Using Enhanced YOLOv8 Lightweight Convolutional Neural Networks

by

Pan Li

¹,

Jiayin Xu

¹ and

Shenbo Liu

^2,*

¹

School of Computer and Communications Engineering, Changsha University of Science and Technology, Changsha 410015, China

²

School of Physics and Electronic Science, Changsha University of Science and Technology, Changsha 410114, China

^*

Author to whom correspondence should be addressed.

Mathematics 2024, 12(14), 2185; https://doi.org/10.3390/math12142185

Submission received: 19 May 2024 / Revised: 3 July 2024 / Accepted: 9 July 2024 / Published: 12 July 2024

Download

Browse Figures

Versions Notes

Abstract

:

As urbanization accelerates, solid waste management has become one of the key issues in urban governance. Accurate and efficient waste sorting is a crucial step in enhancing waste processing efficiency, promoting resource recycling, and achieving sustainable development. However, there are still many challenges inherent in today’s garbage detection methods. These challenges include the high computational cost of detection, the complexity of the detection background, and the difficulty in accurately evaluating the spatial relationship between rectangular detection frames during the inspection process. Therefore, this study improves the latest YOLOv8s object detection model, introducing a garbage detection model that balances light weight and detection performance. Firstly, this study introduces a newly designed structure, the CG-HGNetV2 network, to optimize the backbone network of YOLOv8s. This novel framework leverages local features, surrounding context, and global context to enhance the accuracy of semantic segmentation. It efficiently extracts features through a hierarchical approach, significantly reducing the computational cost of the model. Additionally, this study introduces an innovative network called MSE-AKConv, which integrates an attention module into the network architecture. The irregular convolution operations facilitate efficient feature extraction, enhancing the ability to extract valid information from complex backgrounds. In addition, this study introduces a new method to replace CIoU (complete intersection over union). On the basis of calculating IoU (intersection over union), it also considers the outer boundary of the two rectangles. By calculating the minimum distance between the boundaries, this method handles cases where boundaries are close but not overlapping, offering a more detailed similarity assessment than that provided by traditional IoU. In this study, the model was trained and evaluated using a publicly available dataset. Specifically, the model has improved the precision (P), recall rate (R), and mAP@50 (mean average precision at 50) by 4.80%, 0.10%, and 1.30%, while reducing model parameters by 6.55% and computational demand by 0.03%. This study not only provides an efficient automated solution for waste detection, but also opens up new avenues for ecological environmental protection.

Keywords:

solid waste detection; YOLOv8s; CG-HGNetV2; MSE-AKConv; IoU

MSC:

68M07

1. Introduction

With the acceleration of the global urbanization process, the management of urban solid waste has become an increasingly serious issue [1,2,3]. The imperative transformation of the solid waste management system from a production centric to an environmentally focused framework is an undeniable necessity [4]. A good waste management system is crucial, not only for the cleanliness of the urban environment and the health of its residents, but also for sustainable development and the recycling of resources. In this process, waste detection is undoubtedly a key step. It can significantly improve the efficiency and quality of waste processing, reduce dependence on landfills, promote the recycling of resources, and decrease environmental pollution. However, traditional methods of waste sorting often rely on manual sorting, which is not only inefficient and costly, but also poses health risks [5].

The generation of solid waste inevitably results from human activities, impacting both public health and the environment. Consequently, the field of waste management is garnering a heightened focus on intelligent and sustainable practices, particularly across both developed and developing nations. The inappropriate disposal of waste in non-designated areas presents significant challenges [6], prompting the use of various techniques to identify and sort waste [7,8]. However, the prevalent methods for waste classification largely rely on human expertise, which can be both challenging and laborious for precise waste categorization.

In the field of waste detection, challenges arise from the presence of a wide variety of waste types, as well as differing shapes, sizes, colors, and conditions. These factors significantly increase the complexity of automated classification tasks. Various types of waste also possess distinguishable characteristics; for instance, plastics typically exhibit a certain gloss and color, while metals often reflect gloss and have specific shapes. Early research regarding waste detection often focused on utilizing these characteristics for color and shape-based processing methods. However, these approaches can be influenced by changes in lighting conditions and the fading of waste surfaces, ultimately leading to a significant decrease in their performance.

The utilization of machine learning algorithms, such as artificial neural networks (ANNs) [9], has been extensively explored in the domain of waste identification. Several research efforts have been documented, each proposing various methodologies to enhance the efficiency and accuracy of garbage classification systems. For example, Yuan et al. [10] introduced MAPMobileNet-18, a streamlined residual network aimed at refining the process of waste identification, alongside assessing its accuracy, speed, and compatibility with edge devices. Ying et al. [11] introduced modifications to the YOLOv2 model by adjusting its parameters and implementing optimization and acceleration techniques. These modifications aimed to optimize the balance between the model’s real-time application capabilities and the accuracy of bounding box clustering, impacting its generalization in new or complex environments. Meanwhile, Ying et al. [12] developed a system for autonomous garbage detection, leveraging the open-source Faster R-CNN framework, opting for the ResNet network over VGG for the foundational convolutional layers. However, the ResNet model is relatively more complex, which may lead to an increased computational burden. Conversely, Fu et al. [13] presented an approach utilizing the MobileNetV3 framework; however, while its extensive network architecture contributed to decreased computational speed, it exhibited poor applicability on edge devices. Chen’s [14] work involved the integration of a MobileNet-v2 as a backbone for model distillation in waste detection, achieving a reduction in parameter count and an uptick in accuracy, albeit without addressing model generalizability. Feng’s [15] contribution involved a 23-layer CNN to elevate accuracy in waste detection, yet this complexity inadvertently affected the model’s real-time processing capabilities. Kang’s [16] research leveraged ResNet-34 for amalgamating multiple features from waste imagery, incorporating a novel activation function to enhance the detection of small-sized waste, although it encountered challenges in maintaining real-time detection efficiency due to computational demands. Gupta et al. [17] conducted a comparative analysis of the efficacy of various pre-trained neural networks for the task of garbage classification, employing supplementary hardware devices such as PiCam, Raspberry Pi, and infrared sensors. However, the goal of real-time garbage sorting remained unattained. Shi et al. [18] modified the Xception network to mitigate backpropagation issues, securing commendable classification performance at a high computational cost. Despite these innovations, a common oversight remains the high computational demands associated with ANNs, posing significant challenges for their integration into edge device hardware. Due to the diversity of scenarios in which garbage appears, the accuracy of the test results of the above algorithm model is affected, and the issue of high computational costs is overlooked. This makes it difficult to embed the garbage detection model into edge devices.

Real-time garbage detection presents a significant challenge. To achieve higher precision in garbage detection, it is crucial to continuously refine deep learning algorithms with the goal of achieving accurate multi-scale object detection without sacrificing speed. This will ensure that the network can adapt to variations in scale. In the contemporary era, marked by the swift advancement of deep learning technologies, the domain of object detection has witnessed significant progress. It has been widely applied in various complex scenarios [19]. The methodologies employed in object detection can broadly fall into two distinct categories. The first category encompasses two-stage algorithms, notably Fast Region-Based Convolutional Neural Networks [20], Region-Based Fully Convolutional Networks [21], and Mask Region-Based Convolutional Neural Networks [22]. These algorithms primarily focus on proposing candidate regions before performing the classification and bounding box regression assignments.

However, two-stage algorithms have some flaws, the first of which is that they are usually slow [20]. This is because these algorithms initially require the generation of region proposals [23], followed by classification and bounding box regression for each proposal. This two-stage processing makes the algorithm limited in real-time or fast detection scenarios. Additionally, there is a significant computational resource consumption [22]. Generating high-quality region proposals typically requires complex algorithms, and processing each proposal also necessitates the repeated execution of the same convolution operations, further increasing the computational burden.

On the other hand, the second category involves one-stage detection algorithms, which streamline the process by simultaneously conducting classification and bounding box regression in a single step. This approach, exemplified by the You Only Look Once (YOLO) [24,25,26,27,28,29,30] series, Single Shot MultiBox Detector [31], and RetinaNet [32], offers the advantage of increased processing speed. Single-stage object detection algorithms offer several significant advantages over two-stage algorithms, especially in terms of speed and simplified processes [31]. These advantages make single-stage algorithms particularly popular in scenarios requiring real-time processing and when computational resources are limited [24].

However, single-stage algorithms also have their limitations, such as potentially lower detection accuracy compared to that of two-stage algorithms, in some cases. The gap in accuracy mainly stems from the tendency of single-stage algorithms to produce more false positives [32], as they predict multiple categories and bounding boxes at each location simultaneously. By improving the Yolov8 model algorithm, it is possible to refine the classification of solid waste and effectively reduce the amount of garbage landfilled. This can increase the resource recycling utilization rate, thereby reducing environmental pollution and resource wastage. It also plays a role in promoting the protection of the ecological environment.

This paper improves the YOLOv8s object detection algorithm and tests it on the “Huawei Cloud” datasets, demonstrating that the proposed algorithm enhances detection efficiency. This document’s primary contributions include the following:

This study introduces a newly designed structure of CG-HGNetV2 based on HGNetV2 [33]. This new structure can utilize local features, surrounding context, and global context to improve the accuracy of semantic segmentation. It efficiently extracts features using a hierarchical approach, significantly reducing the computational cost of the model.
This study innovatively designs a network called the MSE-AKConv module, leveraging the AKConv [34] attention mechanism. The design of this module facilitates convolution kernels to adaptively accommodate an arbitrary parameter count and sampling geometries. Its primary function is to augment the focus on pivotal information pertinent to the task at hand while diminishing or eliminating the emphasis on extraneous data, thus enhancing detection capabilities. The irregular convolution operations facilitated efficient feature extraction, enhancing the ability to extract valid information from complex backgrounds.
This study introduces a new method to replace CIoU [35]. On the basis of calculating IoU, it also considers the outer boundary of the two rectangles. By calculating the minimum distance between the boundaries, this method handles cases where boundaries are close but not overlapping, offering a more detailed similarity assessment than traditional IoU.
The study’s experimental findings demonstrate the superior performance of our method when juxtaposed with several predominant models from the YOLO series.

The remainder of this document is organized as follows: An extensive review of literature relevant to our study is presented in Section 2. Following this, Section 3 is dedicated to a detailed exposition of the enhancements achieved in our methodology. In Section 4, this study undertake a series of empirical studies to validate the efficacy of our refined model, with the findings detailed therein underscoring its enhanced performance. The paper concludes with a detailed summary in Section 5.

2. Related Work

2.1. Yolov8 Introduction

YOLO (You Only Look Once) stands as a quintessential example of a one-stage detection algorithm, characterized by its unique capability to analyze an image a single time before generating results. In 2015, Joseph Redmon et al. initiated the framework with the introduction of YOLOv1 [26]. This framework employs a convolutional neural network (CNN) to process the entire image in a single forward pass, thereby identifying the boundaries and categories of the targets. This approach differs from two-stage detection algorithms, offering significant improvements in terms of processing speed. However, it tends to lag behind other methods in terms of detection precision and the capacity to generalize across various contexts. Subsequent iterations, including YOLOv2 [27], YOLOv3 [24], YOLOv4 [25], YOLOv5 [28], YOLOv7 [30], and YOLOv8 [36], have been developed, with each iteration aiming to refine and address the limitations of its predecessors.

YOLOv8 enhances its architecture with a CSPDarknet53 [24] backbone, reducing input features to five distinct scales through downsampling. Drawing inspiration from PANet [37], it integrates a PAN-FPN configuration in its design. This streamlines the model by omitting post-upsampling convolutions in the PAN structure, thereby preserving efficiency. Its detection mechanism is defined by a bifurcated head for separate classification and bounding box adjustments, utilizing distinct loss metrics for each task. As an anchor-free model, YOLOv8 precisely delineates sample categorizations and adopts a task-aligned assigner [38] for dynamic sample allocation, bolstering accuracy and stability.YOLOv8 is presented in five distinct scaled variants: YOLOv8n, YOLOv8s, YOLOv8m, YOLOv8l, and YOLOv8x. The emphasis of this study is on the enhancement of YOLOv8s for the improved detection of garbage targets. The two-dimensional architecture diagram of YOLOv8s is shown in Figure 1.

The YOLOv8s framework (see Figure 1) contains several key parts and their corresponding features: In part (a) of Figure 1, the backbone of YOLOv8 is depicted, which utilizes an optimized version of the CSPDarknet architecture. This backbone effectively balances computational efficiency with feature extraction capability. Part (b) of Figure 1 shows the neck of the network, which introduces depth-separable convolution. This significantly reduces the number of parameters and the computational complexity while maintaining strong feature extraction capabilities. Part (c) of Figure 1 represents the head of the network, incorporating attention mechanisms such as the convolutional block attention module (CBAM). This integration enhances the model’s ability to focus on important features. Additionally, YOLOv8 features an optimized design of skip connections, improving the fusion of features at different scales and aiding in the detection of objects of various sizes. The multi-scale feature extraction is also enhanced, boosting the detection ability for objects of different sizes. Overall, a lightweight design is employed to improve inference speed while maintaining high performance.

2.2. Test Innovation

This paper improves the YOLOv8s object detection algorithm and tests it on the “Huawei Cloud” datasets, demonstrating that the proposed algorithm enhances detection efficiency. The newly designed CG-HGNetV2 structure is based on HGNetV2 [33]. This structure can make use of local features, surrounding context, and global context to improve the accuracy of semantic segmentation. This method uses a hierarchical method to efficiently extract features, which significantly reduces the computational cost of the model.

The innovation of this paper lies in the following aspects:

(1) A network called the MSE-AKConv module is designed, which utilizes the AKConv [34] attention mechanism. The module is designed so that the convolution kernel can adapt to arbitrary parameter counting and sampling geometry. Its main function is to enhance the detection capabilities by increasing the focus on critical information relevant to the task at hand, while reducing or eliminating the focus on irrelevant data. Irregular convolution operations facilitate efficient feature extraction and enhance the ability to extract effective information from complex backgrounds.

(2) A new alternative to CIoU is introduced [35]. On the basis of calculating the IOU, the outer boundaries of the two rectangles are also considered. By calculating the minimum distance between boundaries, the method can handle cases where boundaries are close but not overlapping, providing a more detailed similarity assessment than traditional IoU.

In view of the challenge that the performance of YOLOv8s is highly dependent on the quality and diversity of the training data, the following methods are adopted in this new model to solve the problem. First, advanced data enhancement methods, such as Mosaic, MixUp, CutMix, etc., are employed. Applying techniques such as random erasure, color dithering, and geometric transformations significantly enhances data diversity and boosts the model’s generalization capability. Second, computer graphics technology is used to generate synthetic images, creating real composite data using generative adversarial networks (GANs). This can complement scenarios and targets that are missing from the actual dataset. Third, a model pre-trained using large datasets is employed. Fine-tuning the model to fit specific tasks can reduce reliance on large amounts of labeled data. Fourth, labeled and unlabeled data are combined for training, using false labeling techniques to increase the amount of valid training data. Fifth, advanced regularization methods, such as DropBlock, label smoothing, etc., are applied. These techniques can improve the generalization ability of the model and reduce overfitting. Sixth, a continuous learning strategy is implemented so that the model can continuously learn from new data. This helps the model adapt to changes in the data distribution. The comprehensive application of these methods can significantly reduce the dependence of YOLOv8s on high-quality and diversified training data and improve the generalization ability and robustness of the model. The following sections specify the applicability of these methods.

3. Methods

YOLOv8 was once regarded as the pinnacle of contemporary object detection models, but it is worth noting that the YOLO series is also evolving continuously, leading to significant advances in this field. This provides direction for our future in-depth research. However, the focus of this study remains on the discussion of YOLOv8. The original YOLOv8 model’s performance is limited in facing the unique challenges of waste classification detection tasks. This limitation is particularly evident when dealing with garbage images captured from various angles. These images often have rich and varied backgrounds and contain a large number of objects of different sizes, making object detection more difficult.

To address this issue, this study innovatively modified the architecture of YOLOv8 to adapt to the task of garbage detection. This study adopted CG-HGNetV2, an improved network structure based on the HGNetV2 [33] network, as the backbone network of YOLOv8. The new structure has been selected to replace the standard network utilized in the original model. This updated framework is capable of leveraging local features, surrounding context, and global context to enhance the accuracy of semantic segmentation. It efficiently extracts features through a hierarchical approach, leading to a significant reduction in the computational cost of the model. Furthermore, this study has integrated an attention module known as MSE-AKConv, which plays a key role in directing the network’s focus towards the essential components of the target. Through such improvements, the network can more accurately lock onto and locate the positions of large waste objects. Furthermore, this study introduces a new method to replace CIoU [35]. On the basis of calculating IoU, it also considers the outer boundary of the two rectangles. By calculating the minimum distance between the boundaries, this method handles cases where boundaries are close but not overlapping. It offers a more detailed similarity assessment than does traditional IoU.

This substitution means that our model can learn and adapt faster, and it has also significantly improved the precision of bounding box regression. These strategic adjustments and optimizations have enhanced the performance of the improved YOLOv8 model in the field of garbage detection, offering a more robust and efficient solution for this domain. The improved model is better, not only in terms of object detection accuracy, but also relative to model training efficiency. Through these structural adjustments and enhancements, this study advances the implementation of the YOLOv8 model in the field of waste classification. The three-dimensional structural diagram of this study is shown in Figure 2.

3.1. Lightweight Backbone Network CG-HGNetV2

Accurate and efficient detection is key to garbage detection. However, the YOLOv8 model’s feature extraction relies mainly on 3 × 3 convolution operations. This leads to an increase in model parameters and computational costs, making it unsuitable for rapid detection. Therefore, this paper adopts the network structure of CG-HGnetV2 for optimization. This design enables effective operation, even in resource-constrained environments, while maintaining high accuracy and real-time performance. The structural diagram of this architecture is shown in Figure 3.

Figure 3 shows the overall architecture of CG-HGNetV2. From the data in the figure, it can be determined that CG-HGNetV2 adopts a multi-stage design, mainly including: (a) the initial HG StemBlock, (b) multiple CG-HG stages, and (c) a context guide block (specifically integrated in HG Stage 1). In the StemBlock, the main function is preliminary feature extraction, which consists of four parts: convolutional layers, batch normalization (BN), activation functions, and pooling layers. The main role of StemBlock is to reduce the input resolution, decrease the subsequent computational load, increase the number of channels, enrich feature representation, and preliminarily extract low-level features (such as edges and textures).

The function of the CG-HG stage is mainly deep feature extraction and processing. Regarding the connections between the CG-HG stages, the output of each CG-HG stage serves as the input for the next stage, progressively extracting higher-level features. Additionally, the outputs of different HG stages may be used as multi-scale features, which is beneficial for detecting targets of different sizes. Integrating the context guide block in HG stage 1 can enhance the model’s understanding of the global context at an earlier phase, empowering subsequent stages to better handle scale variations and spatial relationships in object detection.

The CG-HGNetV2 network algorithm combines multiple advanced network design concepts, including Ghost convolution, CSP structure, and context guide modules. The detailed algorithm description of the CG-HGNetV2 network is as follows:

(1): The input image first undergoes preliminary feature extraction and downsampling through the StemBlock.
(2): The feature map sequentially passes through four HG stages, with each stage containing multiple HG blocks.
(3): After the first HG stage, the context guide block is applied to enhance context awareness.
(4): Each HG block uses Ghost convolution and CSP structure to improve efficiency and performance.
(5): The final feature map generates the final detection results through the detection head.

Through this carefully designed architecture, CG-HGNetV2 can effectively extract rich features from the input images while considering computational efficiency and the importance of context information. The StemBlock lays the foundation for feature extraction, multiple HG stages progressively extract deep features, and the introduction of the context guide block enhances the model’s perception of global information. This combination enables CG-HGNetV2 to achieve excellent performance regarding object detection tasks.

This concept was initially proposed in CGNet [39], with the fundamental principle being to mimic the human visual system’s reliance on contextual information for understanding scenes. The CG block is used to capture local features, surrounding context, and global context information. Therefore, the CG-HGNetV2 designed in this study aims to fully leverage local features, surrounding context, and global context. This facilitates the establishment of connections between local and global contexts in the new structure, enhancing the accuracy and stability of the model. Additionally, this design structure enhances the model’s generalization ability, improving its performance in more complex situations. The structure of the context guided block module is shown in Figure 4.

The primary concept of the architecture in this study is to employ a hierarchical approach for feature extraction. This enables the learning of complex patterns at different scales and levels of abstraction, thereby enhancing the network’s capacity to process intricate image data. This layered and efficient processing is particularly advantageous for demanding tasks such as image classification. Precise prediction is crucial in recognizing complex patterns and features at different scales. The HG-block plays a key role as well, being a core component of the network, designed to process data in a hierarchical manner. Each HG-block may handle different levels of data abstraction, allowing the network to learn from both low-level and high-level features. The structure diagram of the HG-block is shown in Figure 5.

Another major feature of the architecture in this study, CG-HGNetV2, is the adoption of lightweight convolution. LightConvBNAct employs a lightweight convolutional structure. It utilizes a two-step convolution process. Firstly, a 1 × 1 convolution is employed for reducing feature dimensions or expansion, without the use of an activation function. This step diminishes the parameter count, while maintaining the current spatial dimensions of the feature map. Subsequently, a group convolution is executed, which is responsible for extracting spatial features. Each output channel is processed by a corresponding input channel through a convolution kernel, achieving the effect of depthwise convolution. Using group convolution significantly reduces the computational load and model parameters.

Assuming that the convolution kernel is square-shaped, K represents the dimensions of the convolution kernel, with both width and height represented by K.

C_{i n}

is the number of input feature map channels.

C_{o u t}

is the number of output feature map channels.

H_{o u t}

and

W_{o u t}

are the height and width of the output feature map, respectively.

H_{i n}

and

H_{i n}

are the height and width of the input feature map, respectively. The computational complexity of standard convolution is shown as follows:

In the discrete case, two-dimensional convolution can be expressed as:

(K * I) (i, j) = \sum_{m = 0}^{P - 1} \sum_{n = 0}^{Q - 1} I (i - m, j - n) K (m, n)

(1)

The input matrix is I, and the kernel matrix is K. The kernel matrix K has a shape of m × n. P and Q are the width and height of the kernel, respectively. To obtain the element at position (i, j) in the output matrix O, the kernel matrix K is slid over the input matrix I. At each step, the corresponding elements of I and K are multiplied, and then these products are summed together.

For a convolution layer, the output can be expressed as:

y_l = f(W_l * x_l + b_l)

(2)

f(x) = {x if x > 0; α(e^x−1) if x ≤ 0} (α > 0)

(3)

Here, y_l is the L-level output, x_l is the input, W_l is the convolution kernel, b_l is the bias, and f is the activation function.

Calculation amount = K \times K \times C_{i n} \times H_{o u t} \times W_{o u t} \times C_{o u t}

(4)

The computational expense of the lightweight convolution is divided into two parts, with the computation amount for the 1 × 1 convolution as shown in Equation (5):

Calculation amount = C_{i n} \times H_{i n} \times W_{i n} \times C_{o u t}

(5)

The computational cost of the group convolution is as shown in Equation (6):

Calculation amount = C_{o u t} \times H_{o u t} \times W_{o u t} \times K \times K

(6)

The integration of lightweight convolution with the context guided block (CGB) optimizes the network by reducing parameters through 1 × 1 and group convolutions, leveraging CGB for enhanced feature expressiveness. The 1 × 1 convolution efficiently merges and reduces feature channels, thereby decreasing computational complexity. Additionally, group convolution further reduces computational costs by independently processing the feature map groups. This combined approach not only lowers computational expenses, but also reduces model parameters by fusing local and global features via CGB. Consequently, this design maintains network expressiveness, enabling adequate performance in resource-constrained environments.

In view of the limitations of YOLOv8s in the detection of small objects, we mainly solve this problem using the following techniques:

(1): Adding high-resolution feature maps, i.e., adding more upper sampling layers to the network structure to generate higher-resolution feature maps. This can provide additional fine-grained spatial information, which is conducive to the detection of small objects.
(2): Introducing attention mechanism, i.e., spatial attention and channel attention mechanisms, such as an SE (squeeze-and-excitation) module, are introduced into the algorithm. This helps the model better focus on the features of the small objects.
(3): Employing data enhancement, i.e., using enhancement techniques, such as random cropping, amplification, etc., and consider the use of Mosaic, MixUp, and other advanced data enhancement methods for small objects.
(4): Initiating loss function improvement. By modifying the loss function, the penalty for small object detection errors is increased.
(5): Introducing auxiliary tasks; i.e., adding auxiliary tasks such as edge detection or semantic segmentation. This can help the model learn more detailed features, which is conducive to small object detection.
(6): Incorporating cascade detection to achieve a two-stage detection strategy; the second stage focuses on the fine detection of small objects.
(7): Utilizing post-processing optimization to improve the non-maximum suppression (NMS) algorithm, i.e., by using Soft-NMS or DIoU-NMS. This helps reduce the chance of small objects being deleted by mistake.

3.2. Effective Attention Mechanism

Addressing the garbage detection challenge in the “Huawei Cloud” datasets involves overcoming significant obstacles due to the heterogeneity of waste object sizes and intricate distribution patterns. Conventional convolution operations are inherently flawed in two respects. Firstly, these operations are constrained to local receptive fields, failing to assimilate information from distant areas, with a rigid sampling architecture. Secondly, the invariable sampling configurations and square kernel shapes exhibit limited adaptability to dynamic target variations. To surmount these limitations, our approach incorporates an adaptive kernel convolution mechanism, dubbed AKConv [34], into the architectural framework. AKConv utilizes a novel algorithm to determine the initial coordinates of the convolution kernels, irrespective of their dimensions. It introduces variable offsets to dynamically modify the sampling contours in response to target alterations. This innovation markedly diminishes both the computational demands and the storage prerequisites of the model, concurrently elevating the precision of garbage entity detection.

This study presents the incorporation of the Mish [40] activation function as a substitute for the SiLU activation function within AKConv. Additionally, after the convolution (conv) sequence operations, an SEnet [41] module is added. This allows the SEnet module to recalibrate the channel importance of the feature maps output by the convolution layer before producing the final result. This method ensures that the module can fully utilize the advanced features provided by the convolution layer. The new module is named MSE-AKconv. Figure 6 illustrates the workflow diagram of MSE-AKConv.

The input image exhibits the following dimensions (C, H, W), where C represents the number of channels, and H and W represent the vertical and horizontal extents, respectively. AKConv uniquely provides the initial sampling shape for the convolution kernel. Following the Conv2d operation on the input image, the sampling shape is adjusted using learned displacements. The resulting feature map undergoes resampling, reshaping, re-convolution, and normalization before being output through the Mish activation mechanism and undergoing further processing by the SEnet module.

Next, this research provides a detailed explanation of the derivation process of the attention mechanism. First, in this work, the function g(X) is defined (Equation (7)).

g (X) = W_{n + 1} * R e L U (W_{n} * X + b_{n}) + b_{n + 1}

(7)

where

W_{n}

and

W_{n + 1}

are weight matrices, and

b_{n}

and

b_{n + 1}

are bias terms.

W_{1} = W_{1} - l e a r n i n g_r a t e * \partial / \partial W_{1}

(8)

b_{1} = b_{1} - l e a r n i n g_r a t e * \partial / \partial b_{1}

(9)

W_{x} = W_{x} - l e a r n i n g_r a t e * \partial / \partial W_{x} (x > 1)

(10)

b_{x} = b_{x} - l e a r n i n g_r a t e * \partial / \partial b_{x} (x > 1)

(11)

W_{n} = W_{n} - l e a r n i n g_r a t e * \partial / \partial W_{n}

(12)

b_{n} = b_{n} - l e a r n i n g_r a t e * \partial / \partial b_{n}

(13)

L represents the loss function, and b_n indicates the influence of element changes on the loss. W_n represents the influence of element changes on the overall loss.

\partial L / \partial W_{n}

denotes the partial derivative of the loss function L with respect to the weight matrix

W_{n}

at the n-th layer. The above steps are repeated several times until the model converges. Through this process,

W_{n}

,

W_{n + 1}

,

b_{n}

, and

b_{n + 1}

will gradually adjust to their optimal values, enabling the attention mechanism to effectively focus on the important features.

The softmax is then calculated, the as shown in Equation (14).

α (X) = s o f t m a x (g (x)) = \frac{e x p (g (x))}{\sum e x p (g (X))}

(14)

Finally, the attention weights are applied to the original features, as shown in Equation (15).

X^{'} = α (X) ⊙ X

(15)

X represents the input features, typically a multi-dimensional tensor with the shape (batch_size, channels, height, width). g(X) denotes a function, usually a small neural network, used to calculate the attention scores.

α (X)

represents the computed attention weights. X′ denotes the weighted output features. ⊙ represents the element-wise multiplication (Hadamard product).

Compared to standard convolutions, MSE-AKConv offers more options, with convolution parameters increasing linearly with kernel size. It uniquely reduces model parameters and computational costs. Traditional convolution operations allow for the parameters to grow quadratically with the kernel size. The kernel typically has dimensions of K × K, where K represents both height and width. The total parameters within a layer, designated as

P_{t r a d i t i o n a l}

, can be determined accurately using Equation (16).

P_{t r a d i t i o n a l} = K^{2} \times C_{i n} \times C_{o u t}

(16)

If considering the bias term, with one bias per output channel, then the total number of parameters should be increased by the number of output channels

C_{o u t}

, as shown in Equation (17):

P_{t r a d i t i o n a l} = {(K}^{2} \times C_{i n} + 1) \times C_{o u t}

(17)

MSE-AKConv allows for the flexible linear adjustment of convolution kernel parameters, meeting specific needs and effectively managing model complexity and computational demands. The parameter count (N) can be linearly adjusted based on various factors, such as task complexity or computational efficiency optimization. The calculation for the parameter count (

P_{l i n e a r}

) is shown in Equation (18).

P_{l i n e a r} = N \times C_{i n} \times C_{o u t}

(18)

This study introduces the Mish activation function into the AKConv framework. This modification is driven by the superior gradient behavior and smoothness of the Mish function. It significantly enhances the capability of the model to identify waste targets. Such an enhancement mitigates the risks associated with gradient disappearance or excessive accumulation, thereby elevating the precision and robustness of the model. Furthermore, the adoption of the Mish activation function bolsters the generalization capacity of the model, rendering it more efficient in tackling complex environments.

After the convolution sequence operations, an SEnet module is added. In this way, the SE module can further enhance the model’s ability to capture the dynamic relationship between channels. This is based on the advanced feature representation of AKConv, thus optimizing the model performance. This integrated method utilizes spatial attention to dynamically adjust the position of the convolution kernels. Additionally, it strengthens the model’s ability to process channel dimension information through the SE module. The structure of the SEnet module is illustrated in Figure 7.

The feature map formula containing the MSE-AKConv attention mechanism is shown in Equation (20). X represents the initial input feature map.

X_{o f f s e t}

represents the adjusted feature map obtained through positional offset and bilinear interpolation. Conv(

X_{o f f s e t}

) denotes the feature rearrangement and convolution operation on the adjusted feature map. Mish indicates the application of the Mish activation function to the convolved feature map. SE represents the application of the SEnet module.

\tilde{X}

is the final output feature map. Mish indicates the application of the Mish activation function to the convolved feature map. The formula for Mish is as shown in Equation (19).

M i s h (X) = X t a n h (l n (1 + e^{x})

(19)

\tilde{X} = S E (M i s h (C o n v (X_{o f f s e t})))

(20)

Under this design, the ability to linearly adjust the number of parameters allows for more flexible control over the number of parameters, which can be adjusted according to actual needs. This ensures model performance while optimizing the model’s computational efficiency and resource usage. In contrast, the number of parameters in traditional convolution operations increases exponentially with the square of the convolution kernel size. This means that as the size of the convolution kernel increases, the number of parameters quickly grows, leading to increased complexity and computational cost of the model. MSE-AKConv optimizes the performance by removing unnecessary parameters, thereby reducing the demand for storage and computational resources. This not only accelerates the model’s training and inference speed, but also decreases energy consumption.

3.3. The Improved Loss Function MPDIoU

In the foundational YOLOv8 algorithm, the bounding box regression employs the CIoU loss function, which, despite its utility, exhibits several limitations. To begin with, the CIoU loss lacks mechanisms to equitably address the disparities between challenging and simpler samples, a critical aspect for enhancing model robustness. Additionally, it incorporates the aspect ratio as a penalizing component within its formulation. This approach, however, falls short in accurately capturing discrepancies between the predicted and actual bounding boxes that share identical aspect ratios, yet diverge in their width and height dimensions. Moreover, the CIoU calculation intricately involves inverse trigonometric operations, thereby escalating the computational demand on the model’s arithmetic resources. The formula for CIoU is shown in Equations (19)–(23):

L_CIoU = 1 − IoU + R_CIoU

(21)

R_{CIoU} = \frac{ρ^{2} (b, b^{g t})}{{(c_{w})}^{2} + {(c_{h})}^{2}} + \frac{4}{π^{2}} (\tan^{- 1} \frac{w^{g t}}{h^{g t}} - \tan^{- 1} \frac{w}{h})

(22)

L_{C I o U} = 1 - I o U + \frac{ρ^{2} (b, b^{g t})}{{(c_{w})}^{2} + {(c_{h})}^{2}} + \frac{4}{π^{2}} (\tan^{- 1} \frac{w^{g t}}{h^{g t}} - \tan^{- 1} \frac{w}{h})

(23)

The intersection over union (IoU) is defined as the proportion of the overlapping region relative to the combined area between the forecasted bounding box and the ground truth box. The parameters involved in this formula are further elucidated in Figure 8. Specifically,

ρ (b, b^{g t})

denotes the Euclidean distance between the centroids of the forecasted and ground truth boxes. Moreover, h and w represent the height and width of the forecasted box, respectively. On the other hand,

h^{g t}

and

w^{g t}

signify the ground truth height and width of the frame, respectively.

c_{h}

and

c_{w}

represent the height and width of the smallest bounding box that encompasses both the forecasted and ground truth boxes.

Acknowledging the shortcomings of CIoU in the context of waste classification, our investigation proposes the adoption of MPDIoU [42] as an alternative loss function. The newly applied MPDIoU loss function aims to refine the efficacy and precision of bounding box regression. It addresses both intersecting and non-overlapping bounding box regression challenges. It incorporates considerations for center point distances and discrepancies in dimensions. This is achieved by employing a similarity metric for the bounding boxes grounded on the minimal point distance. The adoption of MPDIoU simplifies the computational framework, thereby accelerating the convergence of the model and enhancing the precision of the regression outcomes. The architecture of the revised loss function is depicted in Figure 9.

Within Figure 9, entities A and B signify the predicted and ground truth bounding boxes, respectively. The coordinates for the top-left and bottom-right vertices of bounding box A are denoted by

(x_{1}^{A}, y_{1}^{A})

and

(x_{2}^{A}, y_{2}^{A})

, while

(x_{1}^{B}, y_{1}^{B})

and

(x_{2}^{B}, y_{2}^{B})

represent the corresponding coordinates for bounding box B. The variables

d_{1}

and

d_{2}

are utilized to delineate the spatial separations between the top-left and bottom-right vertices of the actual and forecasted bounding boxes, in that order. The computations for

d_{1}

and

d_{2}

are facilitated by the application of Equations (24) and (25).

Through the borders of entity A and predicted value B, the sum of squares of the difference between the x and y coordinates is obtained, as follows.

d_{1}^{2} = {(x_{1}^{B} - x_{1}^{A})}^{2} + {(y_{1}^{B} - y_{1}^{A})}^{2}

(24)

d_{2}^{2} = {(x_{2}^{B} - x_{2}^{A})}^{2} + {(y_{2}^{B} - y_{2}^{A})}^{2}

(25)

Subsequently,

L_{M P D I o U}

can be derived from

d_{1}

and

d_{2}

, as expressed by Equations (26) and (27):

L_{M P D I o U} = 1 - M P D I o U

(26)

M P D I o U = I o U - \frac{d_{1}^{2}}{w^{2} + h^{2}} - \frac{d_{2}^{2}}{w^{2} + h^{2}}

(27)

The calculation of IoU is shown in Equation (28).

I o U = \frac{A ⋂ B}{A ⋃ B}

(28)

Compared to standard IoU, MPDIoU adds a penalty for differences in box sizes, which helps achieve more precise bounding box regression.

Through the derivation of the parameter values of Equations (24)–(28), the calculation framework of MPDIoU is simplified, the convergence speed of the model is accelerated, and the accuracy of the regression results is improved.

Compared to GIoU, the calculation method of GIoU is shown in Equation (29).

G I o U = I o U - \frac{| C - (A ⋃ B) |}{| C |}

(29)

In Equation (29), A represents the predicted box, B represents the ground truth box, and C represents the smallest enclosing box covering both A and B.

MPDIoU focuses on the difference in the perimeters of the boxes, while GIoU considers the smallest enclosing rectangle covering both boxes.

Compared to DIoU, the calculation method of DIoU is shown in Equation (30).

D I o U = I o U - \frac{ρ^{2} (b, b^{g t})}{c^{2}}

(30)

In Equation (30), IoU represents the traditional intersection over union.

ρ^{2} (b, b^{g t})

denotes the Euclidean distance between the center points of the predicted box and the ground truth box, b represents the center point of the predicted box, while

b^{g t}

represents the center point of the ground truth box, and c denotes the diagonal length of the smallest enclosing rectangle that covers both the predicted box and the ground truth box.

MPDIoU employs the minimum perimeter distance instead of the center point distance, which may be more sensitive to targets of different shapes.

Gradient analysis of MPDIoU is shown in Equation (31):

\frac{\partial M P D I o U}{\partial P} = \frac{\partial I o U}{\partial P} - α * (\frac{\partial M P D}{\partial P * c} - M P D * \frac{\partial c}{\partial P * c^{2}})

(31)

In Equation (31),

\frac{\partial M P D I o U}{\partial P}

represents the partial derivative of MPDIoU loss with respect to the predicted box P.

\frac{\partial I o U}{\partial P}

represents the partial derivative of IoU with respect to the predicted box P.

α

represents the balancing parameter used to adjust the relative importance of the IoU term and the MPD term. MPD represents the minimum perimeter distance. c represents the distance between the center points of the predicted box and the ground truth box.

\frac{\partial M P D}{\partial P}

represents the partial derivative of MPD with respect to the predicted box P.

\frac{\partial c}{\partial P * c^{2}}

represents the partial derivative of c with respect to the predicted box P.

This gradient expression shows how MPDIoU simultaneously considers changes in overlap and shape differences. Theoretically, MPDIoU offers several advantages. It is more sensitive to differences in box sizes, enabling fine adjustments. Its computation is relatively simple, potentially offering better optimization efficiency. It combines the benefits of IoU with the considerations of shape.

4. Experiments

4.1. Dataset

In this research, the model under consideration was trained and assessed utilizing the waste image dataset from the “Huawei Cloud” Garbage Classification Competition. The HUAWEI-40 garbage categorization challenge dataset comprises 14,964 images, annotated with 44 types of labels, including disposable lunch boxes, book paper, power banks, leftover food, bags, trash bins, and so on. All images in the dataset were collected via mobile phones from people’s daily lives. In this experiment, the “Huawei Cloud” datasets was segmented into 10,474, 2992, and 1498 images for training, testing, and validation, respectively. A sample of images from the dataset is shown in Figure 10.

The division of samples into training, test, and validation sets occurred during the model’s training phase, with the allocation following a 7:2:1 ratio for the training, test, and validation sets, respectively. In the training set, Figure 11 illustrates the distribution of data across 44 garbage dataset categories, along with the specifics of the label boxes.

4.2. Experimental Platform and Evaluation Criteria

The experiment was performed on a system running Ubuntu 20.04, employing Pytorch 1.11.0, Python 3.8.10, and CUDA 11.3 for its operational framework. The infrastructure for model training was supported by RTX 3090 GPUs. Throughout the experiment, uniform hyperparameters were applied across the training, validation, and testing phases to ensure consistency. The specified parameters included a training epoch count of 200, a batch size of 64, and an image resolution of 640 × 640 pixels. Notably, the training process proceeded without the application of pre-trained weights to the model.

The methodology for evaluating the experimental outcomes hinged on the cross-validation technique. After the phases of training and validation against designated datasets, the model underwent a conclusive performance appraisal utilizing the test dataset. In this comprehensive evaluation, the network’s performance was gauged using four pivotal metrics: precision (P), recall (R), model size, and mean average precision (mAP). To assess these metrics accurately, it was imperative to employ parameters like TP (true positive, reflecting accurate positive identifications), FP (false positive, signifying erroneous positive identifications), and FN (false negative, indicating incorrect negative identifications). Additionally, the Intersection over union (IoU) metric was utilized to quantitatively assess the extent of overlap between the forecasted bounding boxes and the factual ground truth. This metric is expressed as a ratio of their combined union. The precision metric is specifically calculated as the quotient of the quantity of accurately identified positive instances. This calculation is based on the total number of instances flagged by the model, as delineated in Equation (32).

By entering TP into the model, FP obtains the precision parameter value.

P r e c i s i o n = \frac{T P}{T P + F P}

(32)

Recall, as a metric, quantifies the fraction of positively labeled instances that are correctly identified by the model out of the total population of actual positive samples. The computation of recall is presented in Equation (33).

By entering TP from the data in the model, FP obtains the recall parameter value.

R e c a l l = \frac{T P}{T P + F N}

(33)

Average precision (AP) is characterized as the region encompassed by the curve depicting the correlation between precision and recall. The computation of this metric is delineated in Equation (34), providing a quantitative measure of the network’s performance.

The parameters obtained from Equations (32) and (33) are substituted into Equation (34) to obtain the weighted average of the average accuracy (AP) values.

A P = \int_{0}^{1} P r e c i s i o n (R e c a l l) d (R e c a l l)

(34)

The metric of mean average precision (mAP) serves as a quantifier for model detection efficacy across various categories. It is calculated as the weighted mean of the average precision (AP) values for each category. This calculation methodology is encapsulated in Equation (35), providing a thorough evaluation of the model’s effectiveness in detecting diverse categories.

The weighted average of the average accuracy (AP) values obtained in Equation (34) is processed to obtain the metric value of the average accuracy (mAP).

m A P = \frac{1}{N} \sum_{i = 1}^{N} A P_{i}

(35)

Within Equation (35), the term APi denotes the value associated with I, possessing a categorical index. N signifies the number of sample categories within the training dataset, which, for the purposes of this study, is identified to be 44. The notation mAP0.5 is utilized to describe the mean average precision of the detection model at an intersection over union (IoU) threshold of 0.5. Conversely, mAP0.5:0.95 is employed to articulate the mean average precision across an IoU threshold range from 0.5 to 0.95, in increments of 0.05.

To better evaluate model lightweighting, we introduced the metrics of GFLOPs and model parameters. GFLOPs serves as a metric for evaluating the complexity of a model or algorithm, whereas parameter denotes the model’s size.

4.3. Experimental Result Analysis

4.3.1. Before and After Improvement

During the experimental phase, the efficacy of the novel model was assessed utilizing the “Huawei Cloud” datasets. This evaluation, juxtaposed with the outcomes of the YOLOv8s model, elucidates the superior capability of the proposed algorithm in the realm of waste detection. The enhanced performance of the proposed model in comparison to that of YOLOv8s is systematically documented in Table 1 and Table 2. These metrics are based on the results from 200 epochs of experiments. These tables delineate the performance metrics for the “Huawei Cloud” test and validation datasets, respectively. The findings indicate that the introduced algorithm surpasses the performance of the YOLOv8s model in terms of efficacy.

In the test dataset, the proposed algorithm achieved a 4.80% improvement in precision (P) and a 1.27% improvement in mean average precision at an IoU threshold of 0.5 ([email protected]). Additionally, the model size is smaller than that of YOLOv8s. Compared to the original model, there was a reduction of 0.1 GFLOPs and a decrease in the number of parameters by 0.73 million, as shown in Table 1.

In the validation dataset, compared to the original model, the proposed algorithm achieved a 4.80% increase in P and a 0.5% increase in [email protected], as shown in Table 2. The new model has shown significant performance improvements using the “Huawei Cloud” datasets, indicating its effectiveness in garbage classification detection. Our method has significantly improved performance in terms of model lightness and accuracy.

To more precisely evaluate the model’s performance, we generated the PR curves for the model at an intersection over union (IOU) threshold of 0.5, both prior to and following the improvements during the testing phase, as depicted in Figure 12 and Figure 13.

The area under the curve (AUC-PR), a frequently utilized metric to evaluate model performance, signifies that a larger AUC-PR corresponds to superior performance across diverse precision–recall combinations. The enhanced model clearly demonstrates a higher AUC-PR.

4.3.2. Ablation Experiment

To substantiate the efficiency of the algorithm suggested, an evaluative ablation study was performed, utilizing the “Huawei Cloud” datasets. The initial model employed was YOLOv8s [36]. Various enhancement techniques described herein were incrementally incorporated into this model, either singly or in a composite manner. This was done to assess the enhancement of each method’s performance in regards to object detection.

Table 3 presents the results of an ablation study conducted on the “Huawei Cloud” datasets. This experiment involved implementing several enhancements into the foundational YOLOv8s framework. These enhancements included replacing the original backbone with CG-HGNetV2 (as detailed in Table 3 (+CG-HG)), integrating the MSE-AKConv attention module (as detailed in Table 3 (+MSE-AK)), and implementing MPDIoU (as detailed in Table 3 (+MPDIoU)).

To provide a more intuitive understanding of the significance of each improvement method, we offer the following visual representations, as depicted in Figure 14.

Based on the data from Table 4 and Figure 14, it is evident that the integration of lightweight improvements into the YOLOv8s model resulted in a reduction of 6.55% in regards to the model parameter count and 0.03% of computational costs, while maintaining good detection performance. The experimental data also shows that after improvement by CG-HGnetV2, there was a 7.63% reduction in the parameters, with a minimal decrease in accuracy, attributed to the hierarchical feature extraction approach utilized by the structure. The CGB structure employs channel-wise convolutions in the local feature extractor and context extractor to reduce inter-channel computational costs and conserve memory, resulting in a substantial reduction in the parameter count and computational cost, while effectively mitigating accuracy loss.

Furthermore, this study introduces the innovative MSE-AKConv design, which reduces traditional convolution operations by adjusting the initial sampling shape through learned displacements, enhancing adaptability to target variations. This operation dynamically adjusts the convolution kernel’s sampling shape during training, effectively improving detection accuracy and robustness, while reducing sensitivity to target variations and enhancing garbage detection capability. These improvements result in a 1.24% increase in mAP compared to that of the original YOLOv8 model [43].

In Table 4, the precision metrics for the selected object identifications in the “Huawei Cloud” test dataset are detailed. Our dataset examination reveals that the algorithm we propose substantially elevates the precision of detecting various targets within previously unencountered scenarios. Specifically, enhancements in the detection precision of items such as beverage cans, bags, and old clothes were notably significant, with increments of 27.20%, 26.40%, and 21.80%, respectively, despite the potential for a slight reduction in the precision of the detection of certain objects upon integrating new modules. However, the cumulative effect on overall detection precision is unequivocally positive.

To graphically demonstrate the effectiveness of this study, the visualization results are shown in Figure 15. The ablation experiment was conducted across various models under identical conditions, as depicted in Figure 15. From the image comparison, it is clear that the YOLOv8s model exhibits cases of both missed and false detections. Upon adding or improving modules on the basis of the original model YOLOv8s, the image detection performance is enhanced compared to that of the YOLOv8s model. This significantly improves the recall and precision, increasing the certainty of detection for each target.

4.3.3. Mainstream Model Comparison Experiments

During the experimental phase, the efficacy of the novel model was assessed utilizing the “Huawei Cloud” datasets. This evaluation, juxtaposed with the outcomes of the YOLOv8s model, elucidates the superior capability of the proposed algorithm in the realm of waste detection. The enhanced performance of the proposed model in comparison to that of YOLOv3-tiny [24], YOLOv5s [28], YOLOv6s [44], YOLOv7-tiny [30], and YOLOv8s [36] is systematically documented in Table 5 and Table 6. These tables delineate the performance metrics for the “Huawei Cloud” test and validation datasets, respectively.

To present the comparison results more intuitively, the specific comparison results are illustrated in Figure 16.

Analyzing the data in Table 5 and Table 6, and from Figure 16, it is evident that among the numerous models, YOLOv5 and YOLOv6 exhibit relatively lower detection accuracy. Furthermore, they also possess larger parameter counts and high computational costs, requiring more computing resources and time for training and inference. In contrast, YOLOv3-tiny, as a lightweight model, does not demand as much computational cost. However, its detection performance is also not satisfactory, sacrificing a certain degree of perceptual capability. The model presented in this study showcases a lightweight design that significantly enhances its feature fusion and extraction functions. Importantly, the algorithm optimizes the balance between detection speed and accuracy, achieving the highest mean average precision (mAP) while utilizing fewer computational resources.

For a more intuitive comparison of the detection results, the partial dataset detection results are shown in Figure 17. The experiments were conducted under the same conditions to compare the enhanced YOLOv8 model with YOLOv3-tiny [24], YOLOv5s [28], YOLOv6s [44], YOLOv7-tiny [30], and YOLOv8s models.

From the image comparison, it is evident that other mainstream models exhibit instances of missed and false detections. In contrast, our improved model demonstrates enhanced image detection accuracy when compared to that of the YOLOv8s model, with significantly reduced occurrences of missed and false detections. This improvement notably enhances both recall and precision, thereby increasing the certainty of detection for each target.

5. Conclusions

This study has improved the existing architecture of YOLOv8s, making the revised waste identification model suitable for deployment on edge devices. The results from various experiments indicate that this approach not only achieves higher accuracy, but also incurs lower operational costs. By integrating CG-HGnetV2 as the primary network, there has been a significant reduction in the model’s parameters. Additionally, incorporating the MSE-AKconv attention module within the convolutional layers has greatly improved the model’s accuracy. To further enhance regression accuracy, we adopted the MDPIoU loss function, which has also accelerated the convergence process of the network.

The experimental findings reveal an enhancement in garbage detection precision by 4.8%, an enhancement in recall rate by 0.10%, and an enhancement in [email protected] by 1.30% over the results of the YOLOv8s model. This is achieved while reducing model parameters by 6.55% and computational demand by 0.03% GFLOPs. This improvement renders the model apt for high-accuracy applications within environments constrained by memory capacity and computational power, such as embedded systems. Additionally, when benchmarked against alternative models, our novel approach exhibits superior detection capabilities. This provides an important research foundation for urban environmental management and the achievement of sustainable development goals.

This study also possesses some limitations. While the model performs well in resource-constrained environments, its robustness in handling complex or changing scenarios has not been fully validated. Additionally, the model’s performance in identifying different types of garbage may vary, especially for objects with diverse shapes or high levels of occlusion. The focus of future work will be on optimizing the algorithms to improve efficiency, reduce energy consumption, and enhance processing speed. Despite the reduction in model parameters and computational demands, further algorithm optimization to decrease energy consumption and improve processing speed remains a key direction for future research.

Author Contributions

Conceptualization, P.L. and J.X.; methodology, P.L.; software, P.L. and J.X.; validation: P.L., J.X. and S.L.; formal analysis, P.L.; investigation, J.X.; resources, S.L.; data curation, P.L.; writing—original draft preparation, P.L. and J.X.; writing—review and editing, P.L., J.X. and S.L.; visualization, P.L. and J.X.; supervision, S.L.; project administration, P.L., J.X. and S.L.; funding acquisition, S.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data used in this study can be obtained from the corresponding authors.

Acknowledgments

The authors want to thank the editor and anonymous reviewers for their valuable suggestions for improving this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Mohee, R.; Mauthoor, S.; Bundhoo, Z.M.; Somaroo, G.; Soobhany, N.; Gunasee, S. Current status of solid waste management in small island developing states: A review. Waste Manag. 2015, 43, 539–549. [Google Scholar] [CrossRef] [PubMed]
Grazhdani, D. Assessing the variables affecting on the rate of solid waste generation and recycling: An empirical analysis in Prespa Park. Waste Manag. 2016, 48, 3–13. [Google Scholar] [CrossRef] [PubMed]
Alzamora, B.R.; Barros, R.T.d.V. Review of municipal waste management charging methods in different countries. Waste Manag. 2020, 115, 47–55. [Google Scholar] [CrossRef] [PubMed]
Zaman, A.; Ahsan, T. Zero-Waste: Reconsidering Waste Management for the Future; Routledge: London, UK, 2019. [Google Scholar]
Tong, Y.; Liu, J.; Liu, S. China is implementing “Garbage Classification” action. Environ. Pollut. 2020, 259, 113707. [Google Scholar] [CrossRef] [PubMed]
Namen, A.A.; da Costa Brasil, F.; Abrunhosa, J.J.G.; Abrunhosa, G.G.S.; Tarré, R.M.; Marques, F.J.G. RFID technology for hazardous waste management and tracking. Waste Manag. Res. 2014, 32, 59–66. [Google Scholar] [CrossRef] [PubMed]
Chandra, S.S.; Kulshreshtha, M.; Randhawa, P. Garbage detection and path-planning in autonomous robots. In Proceedings of the 2021 9th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO), Noida, India, 3–4 September 2021; pp. 1–4. [Google Scholar]
Sarker, N.; Chaki, S.; Das, A.; Forhad, M.S.A. Illegal trash thrower detection based on HOGSVM for a real-time monitoring system. In Proceedings of the 2021 2nd International Conference on Robotics, Electrical and Signal Processing Techniques (ICREST), Dhaka, Bangladesh, 5–7 January 2021; pp. 483–487. [Google Scholar]
Cai, X.; Shuang, F.; Sun, X.; Duan, Y.; Cheng, G. Towards lightweight neural networks for garbage object detection. Sensors 2022, 22, 7455. [Google Scholar] [CrossRef] [PubMed]
Jian-ye, Y.; Xin-yuan, N.; Xin, C. Garbage image classification by lightweight residual network. Environ. Eng. 2021, 39, 110–115. [Google Scholar]
Liu, Y.; Ge, Z.; Lv, G.; Wang, S. Research on automatic garbage detection system based on deep learning and narrowband internet of things. J. Phys. Conf. Ser. 2018, 1069, 012032. [Google Scholar] [CrossRef]
Wang, Y.; Zhang, X. Autonomous garbage detection for intelligent urban management. In MATEC Web of Conferences; EDP Sciences: Ulis, France, 2018; p. 01056. [Google Scholar]
Fu, B.; Li, S.; Wei, J.; Li, Q.; Wang, Q.; Tu, J. A novel intelligent garbage classification system based on deep learning and an embedded linux system. IEEE Access 2021, 9, 131134–131146. [Google Scholar] [CrossRef]
Chen, Z.; Jiao, H.; Yang, J.; Zeng, H. Garbage image classification algorithm based on improved MobileNet v2. J. Zhejiang Univ. 2021, 11, 1490–1499. [Google Scholar]
Feng, J.; Tang, X.; Jiang, X.; Chen, Q. Garbage disposal of complex background based on deep learning with limited hardware resources. IEEE Sens. J. 2021, 21, 21050–21058. [Google Scholar] [CrossRef]
Kang, Z.; Yang, J.; Li, G.; Zhang, Z. An automatic garbage classification system based on deep learning. IEEE Access 2020, 8, 140019–140029. [Google Scholar] [CrossRef]
Gupta, T.; Joshi, R.; Mukhopadhyay, D.; Sachdeva, K.; Jain, N.; Virmani, D.; Garcia-Hernandez, L.; Systems, I. A deep learning approach based hardware solution to categorise garbage in environment. Complex Intell. Syst. 2022, 8, 1129–1152. [Google Scholar] [CrossRef]
Shi, C.; Xia, R.; Wang, L. A novel multi-branch channel expansion network for garbage image classification. IEEE Access 2020, 8, 154436–154452. [Google Scholar] [CrossRef]
Shen, C. A transdisciplinary review of deep learning research and its relevance for water resources scientists. Water Resour. Res. 2018, 54, 8558–8593. [Google Scholar] [CrossRef]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Dai, J.; Li, Y.; He, K.; Sun, J. R-fcn: Object detection via region-based fully convolutional networks. Adv. Neural Inf. Process. Syst. 2016, 29, 379–387. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 91–99. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:02767. [Google Scholar]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:10934. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Zhu, X.; Lyu, S.; Wang, X.; Zhao, Q. TPH-YOLOv5: Improved YOLOv5 based on transformer prediction head for object detection on drone-captured scenarios. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 2778–2788. [Google Scholar]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W. YOLOv6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:02976. [Google Scholar]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. pp. 21–37. [Google Scholar]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Lv, W.; Xu, S.; Zhao, Y.; Wang, G.; Wei, J.; Cui, C.; Du, Y.; Dang, Q.; Liu, Y. Detrs beat yolos on real-time object detection. arXiv 2023, arXiv:08069. [Google Scholar]
Zhang, X.; Song, Y.; Song, T.; Yang, D.; Ye, Y.; Zhou, J.; Zhang, L. AKConv: Convolutional Kernel with Arbitrary Sampled Shapes and Arbitrary Number of Parameters. arXiv 2023, arXiv:11587. [Google Scholar]
Zheng, Z.; Wang, P.; Ren, D.; Liu, W.; Ye, R.; Hu, Q.; Zuo, W. Enhancing geometric factors in model learning and inference for object detection and instance segmentation. IEEE Trans. Cybern. 2021, 52, 8574–8586. [Google Scholar] [CrossRef]
Li, S.; Shi, T.; Jing, F.K. Improved Road Damage Detection Algorithm of YOLOv8. Comput. Eng. Appl. 2023, 59, 165–174. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
Feng, C.; Zhong, Y.; Gao, Y.; Scott, M.R.; Huang, W. Tood: Task-aligned one-stage object detection. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 3490–3499. [Google Scholar]
Wu, T.; Tang, S.; Zhang, R.; Cao, J.; Zhang, Y. Cgnet: A light-weight context guided network for semantic segmentation. IEEE Trans. Image Process. 2020, 30, 1169–1179. [Google Scholar] [CrossRef]
Misra, D. Mish: A self regularized non-monotonic activation function. arXiv 2019, arXiv:08681. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Siliang, M.; Yong, X. Mpdiou: A loss for efficient and accurate bounding box regression. arXiv 2023, arXiv:07662. [Google Scholar]
Ying, J.D.; Xiao, F.L.; Yu, W.Y.; Kun, W. Optimizing Road Safety: Advancements in Lightweight YOLOv8 Models and GhostC2f Design for Real-Time Distracted Driving Detection. Sensors 2023, 23, 8844. [Google Scholar] [CrossRef]
Zhao, Z.B.; Feng, S.; Zhao, W.Q.; Zhai, Y.J.; Wang, H.T. Fusion of Knowledge transfer and Improved YOLOv6 Thermal image Detection Method for substation equipment. J. Intell. Syst. 2019, 18, 1213–1222. [Google Scholar]

Figure 1. Block diagram of YOLOv8s.

Figure 2. The structure of the algorithm network discussed in this paper.

Figure 3. The process structure of CG-HGNetV2.

Figure 4. The structure of the context guided block.

Figure 5. The structure of the HG-block.

Figure 6. The detailed schematic of the structure of MSE-AKConv.

Figure 7. The detailed schematic of the structure of SEnet.

Figure 8. Diagram of loss function parameters.

Figure 9. Diagram of the MPDIoU loss function.

Figure 10. Partial sample images of the “Huawei Cloud” datasets.

Figure 11. Label data volume and label distribution of garbage categories.

Figure 12. PR diagram of YOLOv8s.

Figure 13. PR diagram of our model.

Figure 14. PR diagram of our model using the test dataset.

Figure 15. Visualization of detection results from ablation experiments tested on data from the “Huawei Cloud” datasets. (a) ointment. (b) Toothpicks. (c) Big bones. (d) wok. (e) chopping block. (f) Peel and pulp.

Figure 16. Comparison results of several mainstream models using the test dataset.

Figure 17. Visualization of comparative experimental detection results from a portion of the data obtained from the “Huawei Cloud” datasets. (a) Glass containers. (b) chopping block. (c) ointment. (d) Toothpicks. (e) Big bones. (f) Peel and pulp.

Table 1. Comparison of the detection effect between YOLOv8s and our model on the test dataset.

Models	P (%)	R (%)	Params (M)	GFLOPs	[email protected] (%)
YOLOv8s	74.20	62.70	11.14	28.5	69.6
Enhanced YOLOv8	79.00	62.80	10.41	28.4	70.9

Table 2. Comparison of the detection effect between YOLOv8s and our model on the validation dataset.

Models	P (%)	R (%)	Params (M)	GFLOPs	[email protected] (%)
YOLOv8s	72.60	62.40	11.14	28.5	68.0
Enhanced YOLOv8	77.40	62.60	10.41	28.4	68.5

Table 3. Ablation experiment using the “Huawei Cloud” test dataset.

Models	P (%)	R (%)	Params (M)	GFLOPs	[email protected] (%)	[email protected]:0.95 (%)
YOLOv8s	74.20	62.70	11.14	28.5	69.61	58.83
+CG-HG	79.60	59.30	10.29	28.3	69.08	57.80
+MSE-AK	77.00	62.20	12.80	29.9	70.85	60.24
+MDPIoU	78.00	59.60	11.14	28.5	69.13	58.68
+CG-HG+MDPIoU	73.90	63.60	10.29	28.3	69.15	57.88
+CG-HG+MSE-AK	80.30	59.50	10.41	28.4	69.55	58.32
+CG-HG+MSE-AK +MDPIoU	79.00	62.80	10.41	28.4	70.88	60.35

Table 4. Precision of the detection of various objects in the “Huawei Cloud” test dataset.

	YOLOv8s	+CG-HG (%)	+MSE-AK (%)	+MDPIOU (%)	+CG-HG+MDPIoU (%)	+CG-HG+MSE-AK (%)	+CG-HG+MSE-AK +MDPIOU (%)
Book paper	54.80	83.60	56.90	63.60	65.80	76.70	72.50
Leftover food	82.30	95.00	87.90	88.10	85.50	87.80	88.40
Bag	68.00	86.20	85.30	76.50	90.50	89.70	94.40
Trash can	73.30	97.70	94.30	64.10	81.40	87.10	87.70
Plastic utensils	78.00	87.20	86.10	92.50	83.90	87.10	89.70
Plastic toys	79.60	65.30	78.00	67.20	66.80	77.50	81.70
Express delivery paper bag	77.00	84.20	91.10	93.50	80.40	83.70	86.10
Plugs and wires	60.30	71.60	66.90	61.60	63.70	72.40	73.60
Old clothes	64.30	93.50	73.80	69.30	83.50	82.90	86.10
Beverage cans	67.00	91.70	76.10	88.20	79.00	79.30	94.20
Pillows	76.90	89.90	88.40	88.40	80.50	83.10	83.70
Peel and pulp	80.30	71.60	83.20	85.20	81.40	89.30	91.10
Cigarette butts	84.80	94.40	91.10	88.90	82.30	92.20	90.10
Toothpicks	60.20	93.80	58.80	64.60	55.70	100.00	62.00
Glass containers	73.70	75.20	71.10	75.50	68.20	77.80	77.80
Chopsticks	71.20	79.50	75.00	74.20	72.10	80.20	81.40
Cartons	85.40	89.90	83.00		84.80	88.10	95.00
Tea leaves	84.90	86.80	89.70	88.20	90.50	88.10	90.00
Vegetables stalks and leaves	79.30	88.10	81.90	81.40	79.40	87.80	81.80
Eggshells	72.10	68.50	70.60	71.40	73.70	74.90	80.50
Spice bottles	78.90	86.80	87.30	89.70	81.70	82.90	90.10
Wine bottles	49.30	57.00	56.60	54.70	50.40	65.70	61.90
Metal utensils	31.20	30.30	36.60	63.60	44.30	45.00	44.50
Wok	73.20	74.00	82.00	80.60	75.10	77.40	79.10
Ceramic ware	70.60	74.50	71.80	71.00	73.90	74.40	75.60
Beverage bottles	67.00	79.20	80.20	64.90	58.30	66.50	74.90
Fish bones	83.30	86.50	77.30	85.00	83.10	89.20	87.70

Table 5. The detectability of other mainstream algorithms and the enhanced YOLOv8s model is compared using the test dataset.

Models	P (%)	R (%)	Params (M)	GFLOPs	[email protected] (%)	Inference (ms)	Old Clothes	Beverage Cans	Cartons
YOLOv3-tiny	38.1	39.9	8.77	13.0	34.8	2.1	0.431	0.426	0.288
YOLOv5s	65.5	56.8	7.13	16.1	61.5	6.4	0.600	0.663	0.809
YOLOv6s	74.3	61.7	16.31	44.1	68.8	6.6	0.622	0.911	0.880
YOLOv7-tiny	65.4	51.7	6.12	13.4	55.6	6.8	0.613	0.529	0.887
YOLOv8s	74.2	62.7	11.14	28.5	69.6	7.1	0.643	0.670	0.854
Enhanced YOLOv8	79.0	62.8	10.41	28.4	70.9	9.6	0.861	0.942	0.950

Table 6. The detection performance of other mainstream algorithms and the enhanced YOLOv8 model is compared for the validation dataset.

Models	P (%)	R (%)	Params( M)	GFLOPs	[email protected] (%)	Inference (ms)	Old Clothes	Beverage Cans	Cartons
YOLOv3-tiny	36.0	39.0	8.77	13.0	32.5	2.0	0.257	0.419	0.294
YOLOv5s	66.8	53.8	7.13	16.1	59.6	6.3	0.750	0.507	0.864
YOLOv6s	74.1	58.5	16.31	44.1	66.2	6.5	0.806	0.697	0.790
YOLOv7-tiny	59.2	52.0	6.12	13.4	55.6	6.8	0.648	0.434	0.781
YOLOv8s	72.6	62.4	11.14	28.5	68.0	7.2	0.741	0.655	0.802
Enhanced YOLOv8	77.4	62.6	10.41	28.4	68.5	9.2	0.924	0.757	0.829

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, P.; Xu, J.; Liu, S. Solid Waste Detection Using Enhanced YOLOv8 Lightweight Convolutional Neural Networks. Mathematics 2024, 12, 2185. https://doi.org/10.3390/math12142185

AMA Style

Li P, Xu J, Liu S. Solid Waste Detection Using Enhanced YOLOv8 Lightweight Convolutional Neural Networks. Mathematics. 2024; 12(14):2185. https://doi.org/10.3390/math12142185

Chicago/Turabian Style

Li, Pan, Jiayin Xu, and Shenbo Liu. 2024. "Solid Waste Detection Using Enhanced YOLOv8 Lightweight Convolutional Neural Networks" Mathematics 12, no. 14: 2185. https://doi.org/10.3390/math12142185

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Solid Waste Detection Using Enhanced YOLOv8 Lightweight Convolutional Neural Networks

Abstract

1. Introduction

2. Related Work

2.1. Yolov8 Introduction

2.2. Test Innovation

3. Methods

3.1. Lightweight Backbone Network CG-HGNetV2

3.2. Effective Attention Mechanism

3.3. The Improved Loss Function MPDIoU

4. Experiments

4.1. Dataset

4.2. Experimental Platform and Evaluation Criteria

4.3. Experimental Result Analysis

4.3.1. Before and After Improvement

4.3.2. Ablation Experiment

4.3.3. Mainstream Model Comparison Experiments

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI