1. Introduction
With the rapid development of artificial intelligence (AI) technology, its application across various fields such as medical diagnostics [
1], autonomous driving [
2], language translation [
3], and image recognition [
4] has become increasingly widespread, significantly advancing and innovating in these areas. AI technology can process and analyze vast amounts of data, extracting valuable information to assist in decision making and prediction, thereby enhancing work efficiency and accuracy. Particularly in the field of image recognition, the integration of deep learning and machine vision technologies has enabled AI to achieve and even surpass human recognition capabilities in aspects such as facial recognition and scene understanding, advancing the field of intelligent image processing.
Among its numerous applications, image-based waste classification [
5] has received considerable attention in recent years. As urbanization accelerates and populations grow, the issue of urban waste becomes increasingly severe, making the effective classification and recycling of waste an urgent problem to be addressed. The real-time and efficient classification of municipal waste can enhance recycling efforts and strengthen waste management, while also ensuring the cleanliness of urban environments and public health.Traditional methods of municipal waste classification primarily rely on manual sorting, which is labor-intensive and prone to errors [
6]. Techniques such as magnetic separation and eddy current separation are commonly used for segregating metal components, but these do not address the sorting of nonmetallic components [
7]. Techniques like air classification and screening have been employed to separate waste based on size and weight, yet these methods lack the precision needed for distinguishing between types of materials that are visually similar but recyclably distinct [
8]. Utilizing AI technology for the automatic identification and classification of waste images not only improves the efficiency and accuracy of waste sorting but also reduces labor costs [
9]. Moreover, it helps increase the proportion of waste recycling, playing a significant role in environmental protection and resource recycling. For instance, Malik et al. [
10] discussed an AI framework that incorporates intelligent recognition and management strategies to improve municipal solid waste image classification. Wang et al. [
11] used MobileNetV3 and IoT technology to achieve high-precision identification of garbage, including plastic, paper, and more. Through deep learning models that identify various types of waste, rapid and precise classification of waste can be achieved, guiding the recycling and processing of waste and providing technical support for urban management and environmental protection.
However, in advancing image-based waste classification efforts, we encounter several significant challenges.
Firstly, intelligent recognition of municipal waste is not yet sufficient, with the challenge lying in the diversity and complexity of waste images [
12]. Municipal waste encompasses a variety of types of refuse, with the shapes, sizes, colors, and configurations of these items potentially varying widely in images. Moreover, waste often appears against complex backgrounds, increasing the difficulty for intelligent recognition systems to identify and classify it. Improving recognition accuracy is crucial for optimizing resource recycling, reducing landfill volumes, and protecting the environment. Therefore, developing efficient intelligent recognition technologies to address these challenges is particularly important.
Secondly, there is a lack of effective multi-label recognition methods. In practice, an image often contains multiple types of waste, requiring the system to identify all waste types in an image simultaneously. However, traditional image recognition methods mostly focus on single-label recognition and fall short in dealing with complex scenarios that include multiple categories of waste, failing to meet the practical application demands.
Finally, faced with the task of processing a large volume of municipal waste classification, reducing computational complexity while maintaining high accuracy to achieve real-time processing poses another challenge. With the increasing amount of urban waste, the demand for processing speed also rises. Ensuring the system can rapidly and accurately process large volumes of data, given limited resources, is crucial for the efficient and automated realization of waste classification tasks.
Our work introduces several key contributions to the domain of municipal waste management through image recognition:
Development of a flexible multi-label image classification framework: We present the Query2Label (Q2L) framework, tailored for the complex task of municipal waste image recognition. This model excels in identifying multiple types of waste within the same image, utilizing self-attention and cross-attention mechanisms to accurately classify waste types, enhancing both accuracy and efficiency.
Utilization of a novel municipal waste dataset: Our study employs the “Garbage In, Garbage Out” (GIGO) dataset, a newly developed collection of urban waste images. This dataset, with its diversity and real-world scenarios, significantly aids in improving the model’s performance by providing a wide array of waste images for training and testing.
High accuracy with low computational complexity: Compared to existing models, our approach achieves superior precision in identifying various types of waste while maintaining computational efficiency. This ensures the model’s suitability for real-time applications, highlighting its potential for practical deployment in waste management systems.
The structure of our paper is outlined as follows:
Section 2 reviews related works in multi-label image classification and intelligent waste identification. In
Section 3, we introduce our novel Query2Label framework and the Vision Transformer as the backbone for our intelligent waste recognition model.
Section 4 describes the GIGO dataset, our experimental setup, and evaluation metrics.
Section 5 discusses the results from our experiments, demonstrating the effectiveness of our model through comparisons and ablation studies. The paper concludes in
Section 6 with reflections on our findings and suggestions for future research directions.
3. Method
3.1. Q2L Framework for Intelligent Multi-Label Waste Image Recognition
In this study, we propose a novel framework, Q2L [
30], aimed at addressing the challenges associated with intelligent multi-label recognition of urban waste images. The framework is designed to overcome the limitations of traditional single-label classification methods, especially in complex scenarios involving images with multiple types of waste. Utilizing self-attention and cross-attention mechanisms, Q2L effectively models the intricate relationships among waste types as well as the interactions between waste images and labels, as shown in
Figure 1.
Initially, the Q2L framework accepts an input waste image , where represents the height and width of the image, and 3 stands for the RGB channels. A feature extraction network, which can be either a convolutional neural network or a transformer-based network, processes the image to produce a feature map , with and d denoting the height, width, and dimension of the feature map, respectively.
Following feature extraction, the core of the Q2L framework is the Transformer decoder, which models the extracted features and label embeddings. Through the self-attention mechanism, the model captures co-occurrence relationships among waste types, while the cross-attention mechanism aligns visual patterns with corresponding labels. Specifically, the operations of self-attention (Self-Attn) and cross-attention (Cross-Attn) can be formulated as follows:
where
represent the query, key, and value matrices, respectively, and
is the dimension of the key matrix. These mechanisms allow Q2L to comprehensively process the dependencies and interactions between various waste types, thus enhancing recognition accuracy and efficiency.
3.2. Backbone
To enhance the intelligent multi-label classification of urban waste images, we incorporate the ViT [
4] as the backbone within our Q2L framework, distinctively suited for parsing the complexities of waste categorization. ViT’s architecture, distinct from conventional convolutional approaches, offers a nuanced understanding of spatial hierarchies and inter-patch relationships critical for identifying various waste components in an image, as shown in
Figure 2.
The preprocessing stage involves normalizing and resizing input waste images to a uniform dimension of . The image is partitioned into pixel patches, akin to words in a sentence for natural language processing (NLP) tasks. These patches are linearly projected into a D-dimensional embedding space, creating a sequence of patch embeddings. To retain positional context, necessary for discerning spatial arrangements of waste types, positional embeddings are integrated with the patch embeddings.
The core analytical process employs a transformer encoder that operates on the patch embeddings, augmented with positional information. This encoder, through self-attention mechanisms, enables the model to focus on relevant segments of the waste image for classification. It effectively captures global dependencies across patches, facilitating a comprehensive understanding of the image context. This is vital for accurately identifying and classifying multiple waste items present within a single image.
3.3. Asymmetric Loss Function
In addressing the complex challenge of multi-label waste image classification, it becomes essential to adopt a loss function that can effectively manage the intricacies of this task, including class imbalance and the presence of hard-to-classify instances. Traditional loss functions like binary cross-entropy (BCE) [
31] offer a foundational approach by evaluating the prediction accuracy across multiple labels. However, this method may not adequately emphasize the more challenging or less frequent waste categories, leading to suboptimal classification performance.
To navigate these challenges, we introduce an adapted asymmetric loss (ASL) function [
32], which is tailored to the unique requirements of waste image classification. The ASL function is designed to mitigate the limitations of conventional loss functions by applying distinct focusing parameters for positive and negative predictions, thereby enhancing the model’s sensitivity to rare and difficult-to-detect waste categories.
The ASL function for a given waste image classification task is formulated as follows:
where
M represents the total number of waste categories,
denotes the ground truth label for category
i,
is the predicted probability for category
i,
and
are the focusing parameters for positive and negative samples, respectively, and
m is a margin applied to adjust the model’s response to highly confident negative predictions, effectively reducing their influence on the loss calculation.
The introduction of a margin m in the ASL function serves to further refine the focus on challenging negative samples, ensuring that the model does not become complacent with easily classified negatives. By dynamically adjusting the influence of positive and negative samples through and , the ASL function allows for a more nuanced training process. This process encourages the model to prioritize learning from misclassified or rare waste types, which are often overlooked by more conventional approaches.
6. Conclusions and Future Work
This study explores how AI is effectively being used in image recognition for managing municipal waste. With cities growing rapidly, efficient waste sorting is more important than ever. The study suggests that AI can improve sorting accuracy and efficiency, which will help boost recycling efforts. The paper introduces a novel framework named Query2Label, combined with ViT/B-16 as the backbone and an asymmetric loss function, to tackle the inherent complexities of multi-label waste image classification. Through meticulous experimentation on the “Garbage In, Garbage Out” dataset, it demonstrates the framework’s superiority in recognizing diverse waste types against varying backdrops, achieving remarkable precision and recall metrics over conventional methods like YOLOv7 and ResNet-101.
Despite its advancements, the study identifies room for improvement in areas such as handling the vast diversity within municipal waste categories and further reducing computational demands to enable real-time processing. The current model, while efficient, might struggle with highly cluttered scenes or rare waste items not adequately represented in the training dataset.
For future work, we envisage several key areas of development to further enhance the capabilities of our image recognition framework for municipal waste management:
Dataset expansion and diversification: To enhance the model’s generalization capabilities across a broader spectrum of waste types and scenarios, it is imperative to expand and diversify the training dataset. This expansion could include a variety of waste materials and configurations, as well as a more extensive range of environmental conditions. Additionally, incorporating data from multiple cities can mitigate the influence of specific urban aesthetics and municipal characteristics, which will further enhance the model’s adaptability and performance across diverse urban settings.
Integration of multiple sensory inputs: Incorporating data from additional modalities, such as infrared imaging, depth sensing, and perhaps even acoustic sensors, could significantly enhance the model’s ability to distinguish between different types of waste in visually complex scenes. This multi-modal approach might reveal characteristics of materials that are not apparent in visual-spectrum photographs alone.
Development of lightweight models: Investigating and developing more efficient model architectures that maintain high accuracy while being computationally less demanding is essential. This could facilitate the deployment of advanced waste classification systems on mobile or embedded devices, enabling real-time processing and decision making at the point of waste collection or sorting.
In summary, the future enhancements of our municipal waste management image recognition framework focus on expanding the training dataset, integrating multiple sensory inputs, and developing lightweight models. These initiatives aim to improve waste type generalization, enhance recognition in complex scenes, and enable real-time decision making, setting a foundation for more accurate and efficient waste classification systems.