1. Introduction
Along with the development of the economy and the improvement of national quality of life, an extremely large amount of urban trash is inevitably generated in daily life. According to a study, the amount of trash generated worldwide will increase from 2.01 billion tons per year in 2016 to 3.4 billion tons per year in the next 30 years [
1]. A large amount of trash pollution will not only cause great harm to people’s physical and mental health but also generate a lot of waste of resources. In order to avoid serious pollution of cities by large amounts of domestic trash, separation of domestic trash is considered to be one of the most effective ways to control environmental pollution [
2]. Faced with the problem of trash classification, many countries have formulated various corresponding solutions. For example, in many cities, relevant laws have been promulgated to stipulate that residents need to classify domestic trash manually [
3]. In addition, manual sorting is still the main means of trash sorting in most parts of the world, which not only increases the health risk of workers but also increases the related cost compared with automatic sorting by machines. Effective trash management and recycling of waste resources are of great importance for urban development and sustainability. With the technological innovation brought about by mechanical automation and the combination of computer vision and artificial intelligence, automatic trash sorting technology that is more efficient than traditional sorting methods has gradually become more mature. One example is replacing the single-arm trash sorting robot ZenRobotics Recycler and putting it into a variety of complex trash-sorting scenarios for industrial application [
4]. Verma et al. [
5] proposed a deep learning-based intelligent system for detecting trash with UAV and provided a low-cost, accurate, and easy-to-use solution for effective trash disposal. The combination of deep learning and image processing has greatly facilitated the application of intelligent sorting in new technologies for smart cities.
Object-detection algorithms have appeared long before deep learning. The main task of object detection is to find the location of an object in an image and classify the object. Currently, object detection has become an important research direction in the field of computer vision. The generation of traditional object-detection algorithms is based mainly on image processing and machine learning algorithms. Traditional object-detection algorithms process images mainly by extracting feature points or feature pixel regions using algorithms such as the Harris algorithm, HOG algorithm, and LBP algorithm [
6,
7,
8]. Then, a series of candidate sliding windows are generated by selective traversal algorithm for image detection, and, finally, image regions are classified into candidate frames using algorithms such as support vector machines (SVM) and decision trees [
9,
10,
11]. It can be seen that the object-detection algorithm in traditional image processing is relatively complicated compared with the deep learning object-detection algorithm, which requires manual feature extraction of the image using relevant feature extraction algorithms. In addition, with the object detection in traditional image processing, it is difficult to effectively extract features from images with complex backgrounds, and it is not only difficult to achieve the required accuracy but also has poor generalization ability. With the continuous development of deep learning research, convolutional neural networks are widely used in the fields of object detection and image classification. The convolutional layer in deep convolutional neural networks has a good abstract feature extraction capability, which can represent the high-level semantic information of different layers in a hierarchical feature representation, eliminating the need to extract the image features by other traditional feature extraction algorithms. With the increasing maturity of classification networks such as AlexNet [
12], a series of single-stage object detection networks and two-stage object detection networks have followed. The first to appear was the two-stage object-detection network represented by R-CNN [
13]. Both Fast-RCNN and Faster-RCNN, improved based on R-CNN, have become object-detection networks represented by high-precision properties [
14,
15]. The YOLO [
16] object-detection network proposed by Redmon et al. in 2016 compensates well for the disadvantage of the two-stage network in terms of detection speed. YOLO as a one-stage detector does not require cascading feature proposers and classifiers but directly classifies and predicts image grid regions, which makes it much faster than two-stage detection networks such as R-CNN. The subsequent SSD series and single-stage detection networks such as YOLOv2 and YOLOv3 have focused on improving the detection accuracy of the network through continuous improvement [
17,
18,
19,
20]. Many lightweight network models represented by the MobileNet [
21] series draw on the important method of depth wise separable convolution. In addition, Hu et al. [
22] proposed the Squeeze-and-Excitation Network in 2017, the first application of attention mechanism in image processing, which improves the expressiveness of the network model and also reduces the parameter redundancy in the network.
With the increasing computing power of GPU and the richness of datasets in recent years, deep learning is widely used in daily life. The application of object-detection models plays a key role in life scenarios. For example, Guo et al. [
23] proposed the MSFT-YOLO model for industrial scenarios with large background interference in steel images, easily confused defect categories, large defect scale changes, and poor detection results for small defects. To enhance and preserve the features of small targets, Ye et al. [
24] proposed the CAA-YOLO model in combination with high-resolution feature layers to better utilize shallow details and location information. Zhao et al. [
25] proposed an improved YOLOv5-based method for accurate detection of wheat spikelet in UAV images, solving the problem of spikelet misdetection and missed detection due to occlusion conditions. Chen et al. [
26] achieved real-time detection in surface detection by applying the attention mechanism to the spatial dimension of the model and adding multi-scale feature fusion to the neck structure of the network model. The combination of the above examples of the latest technological methods of mainstream neural networks also allows us to explore the scope for upgrading various neural networks.
Trash detection involves the localization and classification of trash. This application area has been studied by researchers since the beginning of the 21st century, with the increasing awareness of environmental protection in recent years. Earlier work on trash sorting mainly consisted of classifying different single object images and did not include the function of object location detection. In 2016, Yang et al. [
27] achieved a test accuracy of 63% by classifying the trash dataset TrashNet with a support vector machine (SVM) coupled with a scale-invariant feature transform (SIFT), a result that surpassed the performance of shallow neural networks in comparison experiments. Later, Zhang et al. [
28] constructed a migration learning-based YOLO-WASTE multi-label trash classification model to achieve fast detection and classification of multiple wastes. In 2020, Ye et al. [
29] applied Variational Auto-encoder (VAE) to motivate the YOLO network model to learn decomposable representations of complex data, allowing the enhanced model to improve the regression accuracy for trash detection with complex backgrounds. In 2021, to solve the problem of automated separation of degradable and non-biodegradable waste, Karthikeyan et al. [
30] used the SSD algorithm and augmented clustering algorithm to achieve automated detection of trash targets. Trash detection faces a number of specific challenges, such as small objects being easily missed, complex image backgrounds, and a lack of good quality datasets. Existing trash-detection models are not yet able to overcome these challenges simultaneously, and there is still much room for improvement in detection accuracy and detection speed.
In this paper, we chose YOLOX, one of the state-of-the-art one-stage detection models available today, as our base model. Although YOLOX is excellent in terms of detection accuracy and detection speed, there is still room for improvement in the application of complex domestic trash scenarios. We proposed i-YOLOX to further improve the detection performance of the base model in complex domestic trash scenarios. Overall, the main contributions we made in this paper include:
- (1)
We proposed an improved YOLOX-based trash detection model: i-YOLOX. Firstly, we incorporated a new generation of neural network operator involution in the network, which significantly reduced the parameters of the network and improved the accuracy of the model.
- (2)
The aggregation capability of the deep semantic information of the network was improved by adding the involution CBAM module to feature pyramid networks (FPN) structure. In addition, the capability of classification and regression in decoupled head was improved by adding residual connectivity and involution block.
- (3)
We also proposed a self-made dataset (CDTD) containing 10,000 samples. We collected a total of 17 common trash images datasets in life scenarios and paid more attention to the collection and labeling of multi-object trash images. In addition, we applied various data enhancement techniques such as Mosaic data enhancement technique to further improve the richness of the dataset.
The remainder of this paper is organized as follows: Related methods about original YOLOX network, involution operator, and CBAM attention mechanisms are introduced in
Section 2. Then, the details of improvement in i-YOLOX algorithm are described in
Section 3. A detailed description and analysis of the experimental procedure is presented in
Section 4. In
Section 5, we present extended experiments and discussions. Finally, we summarize our work in
Section 6.
5. Discussion
In our work, there are two main research lines. The first research direction is in the direction of model lightweighting. We work on detection algorithms that can be more easily deployed into vision sensing device applications. Therefore, we chose to start our research using the lightweight features of methods such as YOLOX and the involution mechanism. The second research direction is to improve the generalization ability of the model. Since we started our research from the perspective of trash detection applications, we have become more interested in algorithms to identify various kinds of trash with large intra-class differences. Additionally, the spatial-specific and channel-agnostic properties of involution is beneficial for the algorithm to perform long-range and self-adaptive feature analysis. We combine the above two main ideas for algorithm performance improvement. We can also see from the experimental results based on CDTD that i-YOLOX is indeed suitable for trash detection. To further discuss the generalization performance of i-YOLOX, our network will be subjected to comparative experiments on TrashNet.
TrashNet [
27] is one of the most classic public datasets in the field of trash detection and classification. The TrashNet dataset contains six types of trash images: glass, metal, plastic, paper, cardboard, and trash. As one of the most classic datasets, many researchers use TrashNet to test the performance of the network. We selected some of the state-of-the-art algorithms for comparison so as to exemplify the generalization performance of i-YOLOX. Because all the images contained in the TrashNet dataset are single object images, we compared the object detection algorithm and the classification algorithm together.
Table 6 shows the experimental results on the TrashNet dataset:
All detectors and classifiers in
Table 6 are experimental results based on the TrashNet dataset tests. The value names in the second column of the table are Accuracy/mAP because the accuracy measures of the detectors and classifiers are different. Usually, we use accuracy to measure the accuracies of classifiers and mAP to measure the accuracies of detectors. We can use these two metrics to visualize the high and low performance of different CNN algorithms. In addition, we did not use metrics such as Precision, Recall, or FPS for comparison because of the different measures of detectors and classifiers. In the table, because each CNN algorithm is mainly influenced by training times and optimizers during training, we identified them accordingly. As we can see from the experimental results in the table, our algorithm achieved relatively good performance on TrashNet for both detectors and classifiers.
Further, the dataset played an important role in the experiment. The effect of the dataset on the experiment is also a topic worth discussing. The rich dataset facilitated the construction of more robust detection models. Datasets such as TrashNet are still far from meeting the needs of domestic trash detection. The small number of images in TrashNet and the fact that they are all single object images led to poor generalization of the training model. In other words, we need larger, more labeled, and more diverse datasets. The CDTD dataset proposed in this paper makes up for the lack of trash detection datasets at the present time. In real scene detection, the CDTD dataset enabled our algorithm to generalize which allowed us to detect more objects more accurately. The detection results of i-YOLOX in a multi-object detection scenario are shown in
Figure 12. We can see that the overall detection result was effective. In addition, we notice that false detections inevitably occur in practical applications. For example, the battery in
Figure 12b was mistakenly detected as a power bank. Therefore, we should consider enhancing the collection of small object samples in our further work, thus making the model more capable of detecting small targets. Although CDTD dataset contains 17 labels and 10,000 sample objects, more research is required to collect a richer dataset of trash images in the future.