A Garbage Detection and Classification Model for Orchards Based on Lightweight YOLOv7

Tian, Xinyuan; Bai, Liping; Mo, Deyun

doi:10.3390/su17093922

Open AccessArticle

A Garbage Detection and Classification Model for Orchards Based on Lightweight YOLOv7

by

Xinyuan Tian

¹,

Liping Bai

^1,* and

Deyun Mo

^1,2

¹

Macau Institute of Systems Engineering, Macau University of Science and Technology, Taipa 999078, China

²

School of Mechanical and Electrical Engineering, Lingnan Normal University, Zhanjiang 524000, China

^*

Author to whom correspondence should be addressed.

Sustainability 2025, 17(9), 3922; https://doi.org/10.3390/su17093922

Submission received: 6 March 2025 / Revised: 21 April 2025 / Accepted: 23 April 2025 / Published: 27 April 2025

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

The disposal of orchard garbage (including pruning branches, fallen leaves, and non-biodegradable materials such as pesticide containers and plastic film) poses major difficulties for horticultural production and soil sustainability. Unlike general agricultural garbage, orchard garbage often contains both biodegradable organic matter and hazardous pollutants, which complicates efficient recycling. Traditional manual sorting methods are labour-intensive and inefficient in large-scale operations. To this end, we propose a lightweight YOLOv7-based detection model tailored for the orchard environment. By replacing the CSPDarknet53 backbone with MobileNetV3 and GhostNet, an average accuracy (mAP) of 84.4% is achieved, while the computational load of the original model is only 16%. Meanwhile, a supervised comparative learning strategy further strengthens feature discrimination between horticulturally relevant categories and can distinguish compost pruning residues from toxic materials. Experiments on a dataset containing 16 orchard-specific garbage types (e.g., pineapple shells, plastic mulch, and fertiliser bags) show that the model has high classification accuracy, especially for materials commonly found in tropical orchards. The lightweight nature of the algorithm allows for real-time deployment on edge devices such as drones or robotic platforms, and future integration with robotic arms for automated collection and sorting. By converting garbage into a compostable resource and separating contaminants, the technology is aligned with the country’s garbage segregation initiatives and global sustainability goals, providing a scalable pathway to reconcile ecological preservation and horticultural efficiency.

Keywords:

deep learning; orchards; garbage classification; YOLOv7; lightweight network

1. Introduction

The world population is expected to reach 9 billion in 2050 and 11 billion by 2100 [1]. In order to meet the strong demand of the huge number of people, crop production will increase dramatically, which further contributes to the generation of agricultural wastes (AWs) [2]. The main categories of AWs that are of public concern and threaten the sustainability of agricultural systems are crop residues (leaf litter, pods, stalks, and hulls), poultry wastes (spilled feed, feathers, and faeces), agro-industrial wastes (bagasse, molasses, fruit peels, coconut peels), fruit pulp (oranges, apples, mangoes, etc.) and aquaculture wastes [3]. Globalised garbage segregation policies have proven to be a key governance tool for enhancing the efficiency of resource transformation and optimising the quality of the human environment. Standardised garbage disposal mechanisms not only significantly improve resource recycling efficiency, but also have a significant impact on the quality of life of the public and are an important indicator of the level of civilisation in society [4,5,6]. Current approaches to managing garbage in orchards remain heavily dependent on manual efforts—workers typically gather debris like pruned branches, fallen leaves, and discarded pesticide containers only after harvest seasons. This centralised, post-event cleanup fails to tackle the constant stream of mixed garbage generated daily, from biodegradable organic matter to persistent plastics like mulch films. It is either burned openly, releasing toxic fumes, or buried haphazardly, where pesticide residues seep into groundwater and microplastics infiltrate soils. Such practices not only squander valuable resources but also perpetuate environmental degradation, threatening the very ecosystems that sustain horticultural productivity. The implementation of orchard garbage segregation techniques has the potential to improve the efficiency, reuse and environmentally sound treatment of garbage, thereby promoting the resourceful use of orchard garbage [7].

Global orchard garbage generation reaches 280–350 million tonnes annually, a scale equivalent to 1.8 times the global municipal garden garbage. However, due to bottlenecks in treatment capacity and degradation technology, the current resource utilisation rate is less than 28%, which is a key constraint to improving the effectiveness of agricultural recycling systems. At present, orchard garbage disposal mainly relies on manual sorting, which is inefficient and unable to quickly deal with the large amount of garbage produced by orchards every day. In addition, the working environment for garbage sorting is poor, which is not conducive to the physical and mental health of workers [8,9]. Orchard garbage is mainly pruned branches, rusty tools, and glass, which are hard, slow to degrade, and usually not specifically handled. Lawn garbage resembles crushed grass clippings, is soft and perishable, and may be suitable for composting for reuse. Roadside garbage is like greyish leaves, often mixed with vehicle exhaust pollutants and plastic garbage, and requires special disinfection. Therefore, exploring methods for garbage classification and detection is an important foundation for the effective utilisation of orchard garbage resources and a key support for sound garbage management. The specific objective of our core technology research work is to design a lightweight garbage sorting model that not only meets stringent deployment prerequisites but also ensures optimal performance. To address the challenges of garbage detection and sorting in orchards, we propose a new approach that utilises the YOLO algorithm, a lightweight backbone network, a semantic feature-relational network, and supervised comparative learning. This is particularly important in orchard environments, where garbage can be diverse and relationships between objects can provide valuable information for accurate classification.

In addition, the model is designed to work perfectly with robotic arm operations, thus facilitating a fully automated, unmanned and intelligent garbage sorting system. The ultimate goal is to deploy such a system in orchard environments in the foreseeable future, transforming garbage management practices in this area.

The remainder of this paper is organised as follows: Section 2 reviews the related works and dataset selection. Section 3 explains the methodology, including the proposed lightweight YOLOv7 model. Section 4 presents the experimental results and evaluations. Finally, Section 5 concludes the paper and discusses future work.

2. Related Work

This section is devoted to an examination of the dataset and the presentation of an overview of the extant research on approaches to the classification of garbage. These contents are of considerable assistance to our research in this paper.

2.1. Dataset

In the context of visual recognition based on deep learning, the primary training dataset currently employed is a publicly accessible garbage classification dataset comprising 2527 images from six categories. This dataset was constructed by Mindy Yang and colleagues at Stanford University [10]. The dataset, designated as GINI, encompasses a total of 2561 images portraying garbage materials [11]. Of the total number of images, 956 were sourced from the internet via a search engine using specific keywords related to garbage, including “roadside garbage” and “market garbage”. Additionally, each image in the dataset is labelled with its level of severity and biodegradability. In 2020, Shenzhen, China, hosted the Huawei Cloud-Garbage Sorting Challenge Cup, a data application innovation competition, and released a set of household garbage image data [12]. TACO is a dataset for the classification and detection of garbage, comprising 1500 images and 4784 annotations [13]. Despite its limited size, the dataset can be used for garbage classification and edge detection. However, the lack of annotations in the TrashNet dataset and the unreliability of those in the TACO dataset have prompted the release of the AquaTrash dataset, which contains 369 images across four categories and has been manually annotated to ensure accurate training results [14].

We found that none of the existing datasets are specific to ‘orchard garbage’, except that these generic garbage classification image datasets may contain categories related to orchard garbage, such as ‘husks’, ‘peels’, ‘branches’, ‘tree branches’, etc. We will explain in detail how to expand the dataset for training in Section 4.

2.2. Garbage Classification Model

With the rapid development of deep learning, effective classification of garbage using deep learning has been a widely studied topic by scholars at home and abroad. These methods can be summarised and integrated into ResNet-based methods, DenseNet-based methods, and methods based on the combination of migration learning and convolutional neural networks:

ResNet-based methods. An intelligent garbage classification system based on ResNet50 and Support Vector Machine (SVM) was proposed by Adedeji et al. [15]. It employed ResNet50 for feature extraction and SVM to categorise the extracted features with an accuracy of 87% on the Trash dataset. Yaqing G, based on ResNet50, designed a lightweight garbage classification model, GA_MobileNet [16]. It reduces the computational effort and parameters by using deep convolution and grouped convolution, and improves the accuracy of the model by channel attention mechanism. The proposed methodology addresses the issue of garbage classification on embedded devices.
DenseNet-based methods. Susanth G S et al. validated VGG16, AlexNet, ResNet50, DesneNet169 on the dataset TrashNet and found that DenseNet169 performs better and achieves 94.9% detection accuracy [17]. Mao W L et al. used a genetic algorithm to optimise the hyperparameters of the DenseNet121 fully connected layer to improve the accuracy [18]. Experimentally, it was demonstrated that using two fully connected layers as classifiers for DenseNet121 performs better on the garbage classification task than the original DenseNet121 equipped with full-domain average pooling and softmax classifiers.
Methods based on the combination of transfer learning and convolutional neural networks. Feng J et al. proposed a method for garbage image classification based on transfer learning and Inception-v3 [19,20], which retains the excellent feature extraction capability of the Inception-v3 model while being able to have high recognition accuracy when there is insufficient image data. Cao L used transfer learning to train a model specifically for recognising garbage categories based on the Inception-v3 model, using transfer learning to train a model specialised in recognising garbage categories, improving the recognition rate through algorithmic research and model modification [21]. Yang et al. developed a novel framework called GarbageNet for incremental learning, to address the problems of a lack of sufficient garbage image data, high cost of category incrementation and noisy labels [22]. Their approach uses an incremental learning method to make the model continuously learn and update from new samples, while eliminating the effect of noisy labels through AFM (AttentiveFeatureMixup). Chen Yu et al. proposed a garbage classification method based on the improved YOLO algorithm, which uses CSPDarknet-53 as the backbone feature extraction network, effectively solving the problem of excessive inference computation and ensuring the accuracy of the model. Meanwhile, by adding several new spatial pyramid pooling (SPP) modules, a better fusion of global and local features is achieved [23].

We found very few models specifically for orchard garbage classification. In this study, we will refer to the training methods of outstanding scholars to optimise the recognition of orchard garbage.

3. Methodology

This section focuses on the algorithm selection and a few key techniques used. The approach in this paper starts from lightweighting as an objective, and different backbone networks are selected for replacement testing. We ensure the accuracy and efficiency of the recognition by replacing the two backbone networks with a combination of two backbone networks, and innovatively use the Feature Fusion Module to make the model more applicable to real-world scenarios.

3.1. YOLO Algorithm

In 2016, Redmon proposed the YOLO algorithm [24], which uses neural networks for image classification and target detection. Therefore, this algorithm is often used in the field of target detection. The YOLOv7 model is an enhanced iteration of the YOLO algorithm, which is one of the most efficient target detectors and will lay the groundwork for more complex target detection operations [25]. The CSPDarknet53 network serves as the underlying network structure of the YOLOv7 model, which helps to extract preliminary image feature information. The NECK module is used to augment the image features and consists of SPP (Spatial Pyramid Pool) and PANet components. The YOLOHead module is responsible for transforming the extracted features into the desired prediction results. In this study, YOLOv7 is used as the base model for experimental analysis and a lightweight strategy is proposed to improve its performance in garbage classification. The technique addresses the challenges of practical deployment of the model, which is necessary for real-world applications. This is because in real-world operational environments, the model needs to be deployed on edge or embedded devices in most cases [26].

To address this problem, we propose a lightweight YOLOv7 model based on a relational network feature fusion module with the aim of achieving efficient spam classification and detection. Specifically, we propose to use MobileNetV3 [27] and Ghostnet [28] instead of CSPDarknet53 as the underlying networks for the lightweight YOLOv7 model. In addition, we introduce a relational network module to integrate the multi-channel features generated by MobileNetV3 and Ghostnet. Furthermore, to address the problem of degraded image representation in YOLOv7, we integrate two pre-trained models in the network using supervised comparisons to learn the objective function. The model is trained on a supervised dataset, which allows it to learn effective image representations. While the standard YOLOv7 model is too bulky to be practical, the lightweight YOLOv7 model studied in this paper has potential advantages for practical applications. In addition, it is capable of real-time recognition of embedded and edge devices, while also addressing the challenges of limited memory capacity and incompatibility of the standard model with edge devices. The improved YOLOv7 network architecture is shown in Figure 1, and each technical component is described in detail in later sections.

3.2. Lightweight Backbone Network

Deploying existing garbage classification models in embedded devices is a major challenge due to the large size of these models. Therefore, we employ a lightweight neural network in YOLOv7 instead of the CSPDarknet53 network to obtain preliminary image feature information. It is worth noting that CSPDarknet53 is responsible for most of the image feature extraction, and it had a considerable impact on the design of YOLOv7. However, with the development of deep learning, we found that while networks consisting of a large number of layers may yield more sophisticated feature information extraction, they also require higher training costs. When the number of layers in a stacked network exceeds a specific threshold, the training process becomes less efficient. Therefore, there is an increasing tendency to utilise lightweight networks instead of complex, computationally intensive neural networks, while ensuring the ability to extract image features efficiently. Based on this, an experimental investigation was conducted to assess the potential of lightweight neural networks, including MobileNetV1 [29], MobileNetV2 [30], MobileNetV3 [27], and GhostNet [28], as replacements for CSPDarknet53. These networks were evaluated for their ability to achieve the desired lightweight characteristics by reducing the number of parameters in the backbone structure through the use of continuous feature layers obtained during the convolution process.

3.3. Semantic Feature-Wise Relation Network

An enhanced feature extraction network based on relational networks can fuse the features of the three effective feature layers output from the lightweight neural network described above, thereby driving the model to extract more accurate image features. The relational network (RN) is particularly suited to target detection tasks, as it is designed to infer generalisations of higher-order neural networks [31]. In other words, it deduces connections amongst matrices in an operation that optimises the utilisation of data. In other visual tasks, RN has been demonstrated to perform efficiently learned generalisations without the necessity to learn separate weights for all objects from the data. In this section, we extend the RN model by employing a data fusion function to handle relationships across binary groups of vector image features.

In this study, we propose a conjecture regarding the interplay between two networks, namely MobileNetV3 and GhostNet. To test this conjecture, we selected any two of the three effective feature layers in their outputs and combined them as a group, performing feature fusion on them. This operation allows the two networks to complement each other’s strengths and extract additional feature information, thereby improving accuracy. The RN can be represented as a composite function in its simplest form, as follows:

R N (O) = f_{ϕ} (\sum_{i, j} g_{θ} (o_{1}, o_{i}, \dots \dots, {o_{j}, o}_{n}))

(1)

where

O

is a collection of input objects

{o_{1}, o_{2}, \dots \dots, o_{n}}

,

o_{i} \in R m

,

f_{ϕ}

and

g_{θ}

are functions with trainable parameters. The internal function

g_{θ}

learns a relation on tuples, thus providing an abstract representation for the external classifier

f_{ϕ}

.

The three features extracted by MobileNetV3 and Ghostnet are converted into relation vectors, which can be expressed as follows: given an input image x, the outputs a and b are produced by passing x through the MobileNetV3 and Ghostnet models, respectively. We obtain the encoding vector [

a_{i}

:

b_{i}

],

i \in

(1,2,3), by employing stitching by feature dimension, where

a_{i}

is one of the three layers of features extracted from an input sample image x by the model MobileNetV3, similarly

b_{i}

is one of the three layers of features extracted from a given sample image x by the model Ghostnet.

l_{i} = g_{θ} ([a_{i} : b_{i}])

(2)

corresponding to the above equation, the relation vector

l_{j}

is inferred from

g_{θ}

using the vector [

a_{i}

:

b_{i}

] concatenation as input in the encoding step.

g_{θ}

is a MLP with learnable parameters

θ

, and

g

will produce the relation vector

l_{i} \in L

,

i \in {0, 1, . ., n}

.

FFM (Feature Fusion Module)

A relational network-based feature fusion layer is employed for the model, thereby facilitating its successful implementation in real-world operational contexts. This addresses the issue of ensuring the model’s accuracy without compromising the lightweight nature of the garbage detection model. In detail, we first employ an MLP

ρ

with parameter

θ

to obtain the relation vector

l_{i}

by stacking the feature layers together in the depth direction of the image. We denote the feature layers output by backbone 1 as

M_{11}

,

M_{12}

and

M_{13}

, and backbone 2 outputs features

M_{21}

,

M_{22}

and

M_{23}

, each of them with a channel number of 3. The features

M_{1 i}

and features

M_{2 i}

are stacked by splicing as follows:

l_{i} = ρ_{θ} ([M_{1 i}, M_{2 i}]) i \in (1, 2, 3)

(3)

where the relation vector

l_{i}

can be represented as the relation feature of layer i in the image sample extracted by the model.

Following the acquisition of relation vectors, our strategy was compared with that of Santoro et al., who employed an RN to aggregate the feature outputs of the model, subsequently feeding a merged relation vector to the classifier [31]. It was determined that a single merged relation vector, in which the relations are represented as unities, is an optimal solution. This approach is insufficient for compelling the model to comprehend the finer details of the relational structure inherent to the dataset. The MobileNetV3 and Ghostnet models both collect information about the dataset from three distinctive feature layers, wherein the relationships are ternary. In light of these considerations, we have developed a novel relational fusion unit, termed Semantic Feature-wise Transformation (SFT), inspired by the work of Perez et al. [32]. This image features a transformation layer that enables the model to capture diverse weights during the integration of three learned relation vectors,

l_{i}

. In essence, the model is capable of consolidating each of the relationship vectors acquired through MobileNetV3 and GhostNet into a unified relationship vector, as illustrated in Figure 2.

The SFT can be expressed as follows:

S F T (n, C, L) = \sum_{i = 1}^{n} α (c_{i}) ⨀ l_{i} + β (c_{i})

(4)

where C is the set of n binary groups

[a, b]

, L is the set of n relation vectors in Equation (1),

α

and

β

are MLPs. The output of SFT is a fused relation vector, which represents all the relation information present in the sample images extracted by MobileNetV3 and GhostNet. Figure 3 provides a comprehensive illustration of the structure in question.

3.4. Contrastive Learning

The findings of recent research conducted by Ethayarajh and colleagues indicate that the representation acquired by the pre-training model exhibits anisotropy [33]. The embedding learned by the pre-trained models forms a straightness cone in the vector space, which significantly constrains the representational power of the representations. To address this issue, we draw upon the research ideas of Becker and Hinton with the aim of ensuring the consistency of image representation in the face of smaller transformations [34]. This is extended by making use of the latest research achievements in the fields of network structure, data augmentation and contrast loss. In light of the most recent developments in contrastive learning algorithms, we have implemented a contrastive learning loss function with the objective of guiding the model towards the representation of consistent features across different instances within the same dataset, achieved through contrast loss in the hidden space. This approach optimises the learning of representations.

As shown in Figure 4, the self-supervised contrast loss (left panel) compares a single positive image (an enhanced version of the same image) of each anchor point with a set of negative images consisting of the remainder of the entire batch. Supervised contrast loss (right panel), on the other hand, compares the set of all samples of the same category as positive images with the negative images of the rest of the batch. As the photographs of different species of tigers show, taking category labelling information into account leads to elements of the same category in the embedding space being closer together than in the self-supervised case.

3.4.1. Self-Supervised Contrastive Learning

The self-supervised contrastive learning process involves randomly drawing mini-batches comprising N examples and defining the contrastive prediction task as a pair of augmented instances randomly selected from the mini-batches. It is important to note that a pair of augmented instances is an adjacent sample obtained by using a data augmentation strategy centred on the original instances and combined with the original instances as a pair. This results in 2N data instances. In lieu of sampling the negative sample, the remaining 2 (N-1) positive samples are treated as negative samples, thus aligning with the approach proposed by Chen et al. [35]. Consequently, 2 (N-1) augmented samples, excluding the positive samples in a mini-batch, are regarded as negative samples. Assuming that equation

s i m (u, v) = u^{Τ} v / ‖u‖ ‖v‖

denotes the dot product between regularised

u

and

v

, the loss function for a pair of positive samples

(i, j)

is defined as follows:

ι_{i, j} = - l o g \frac{e x p (s i m (h_{i}, h_{j}) / τ)}{\sum_{k \in 2 N, k \neq i} e x p (s i m (h_{i}, h_{k}) / τ)}

(5)

where

k

is a sample other than the original sample

i

in a mini-batch and

τ

is a temperature parameter. The computation of the final loss encompasses all positive sample pairings within a mini-batch, including both

(i, j)

and

(j, i)

. Within this study, we employ a pre-trained YOLOv7 model to encode the input images and then fine-tune all parameters with a contrastive learning objective function. It is worth noting that an important issue in contrastive learning is how to construct sample pairs

(x_{i}, x_{i}^{+})

. According to Dosovitskiy et al. [36], an effective solution in visual representation is to take the same images and obtain

x_{i}

and

x_{i}^{+}

by operations of simple transformations (e.g., cropping, flipping, morphing, and rotating).

3.4.2. Supervised Contrastive Learning

In this section, we investigate whether we can utilise supervised datasets to provide better training signals to enhance the classification performance of the model. We make reference to the supervised contrast loss studied by Khosla et al. [37], which is formulated in the following manner:

L^{s u p} = \sum_{i \in N} \frac{- 1}{| S_{i} |} \sum_{h_{i}^{+} \in S_{i}} l o g \frac{e x p s i m (h_{i}, h_{i}^{+}) / τ}{\sum_{j \in N, j \neq i} e x p s i m (h_{i}, h_{j}^{+}) / τ}

(6)

where positive samples

h_{i}^{+}

and anchors

h

belong to the same class in the Mini-Batch and

S_{i}

is the set of positive samples. The positive sample set comprises the initial samples that share the same category as the anchors in the mini-batch, along with the augmented samples resulting from the transformation of the original data.

It is notable that a substantial body of research has demonstrated that incorporating a greater number of negative samples can enhance the efficacy of the model. The objective is to ensure the generation of a substantial number of negative samples during the training phase, which is achieved by incorporating the momentum encoder module into Equation (6). The process of storing and updating historical embeddings during training can be conceptualised as a dynamic embedding query task. Specifically, let

θ_{q}

and

θ_{k}

be the parameters of the query

{e n c o d e r}_{q}

and

{e n c o d e r}_{k}

, respectively. Here, the parameter

θ_{k}

is not updated using the gradient descent method, but is learned at each training step by the following momentum update rule:

θ_{k} \leftarrow m θ_{k} + (1 - m) θ_{q}, m \in [0,1)

(7)

He et al. carried out a study on the details of the momentum encoder [38]. Finally, we assume an anchor point

x_{i}

, positive sample

x_{i}^{+}

and a negative sample

x_{j}

. Then, the query embedding can be expressed as

h_{q} = P r o j ({e n c o d e r}_{q} (x_{i}))

,

h_{q} = P r o j ({e n c o d e r}_{q} (x_{i}^{+}))

, and the key embedding can be expressed as

h_{k} = P r o j ({e n c o d e r}_{q} (x_{j}))

, as shown in Figure 5.

3.5. Loss Function

In the context of target detection tasks, loss functions typically encompass both classification loss functions and regression loss functions. The proposed model employs the CIOU regression loss function [39], which can be expressed as follows:

L o s s_C I o U = I o U - \frac{D_{1}^{2}}{D_{2}^{2}} - α ν

(8)

α = \frac{ν}{(1 - I o U) + ν}

(9)

ν = \frac{4}{π^{2}} {(a r c t a n \frac{w^{g t}}{h^{g t}} - a r c t a n \frac{w}{h})}^{2}

(10)

where

α

denotes the influence factor,

ν

denotes the consistent length ratio parameter,

D_{1}

is the Euclidean distance between the centroid of the predicted frame and the centroid of the target frame, and

D_{2}

indicates the diagonal distance of the minimum enclosing rectangle.

\frac{w}{h}

and

\frac{w^{g t}}{h^{g t}}

denote the aspect ratios of the predicted and the target boxes, respectively.

In summary, the lightweight backbone network is designed to reduce the computational complexity and number of parameters in the model while maintaining high detection accuracy. This is particularly important in orchard environments where detection devices may have limited computational resources. The semantic feature-relational network can improve the model’s ability to understand complex scenes by learning the contextual relationships between garbage objects. This is particularly important for orchard garbage detection because garbage in orchards may be mixed with the surrounding natural environment (e.g., leaves, fruits, etc.), which increases the difficulty of detection. Supervised contrast learning enhances the model’s ability to distinguish between different categories of garbage by comparing the features of different samples. This approach can effectively solve the problems of category imbalance and small target detection in orchard garbage detection.

4. Experiment and Result

This section presents a detailed account of the selection process for the experimental dataset, the configuration of the experimental setup, and the findings obtained from the experimental procedure. The objective is to demonstrate the viability and advantages of the proposed method.

4.1. Dataset Collation

There are no publicly available datasets specifically designed for orchard garbage classification. Existing garbage datasets, such as TrashNet, TACO, and the Huawei Cloud-Garbage Sorting Challenge dataset, primarily focus on urban or household garbage categories and lack critical orchard-specific materials like pruned branches, pesticide containers, and biodegradable fruit residues. To address this gap, we constructed a dedicated orchard garbage dataset through multi-source data collection and augmentation.

The dataset was built from an existing dataset by searching for agricultural keywords to obtain relevant images and refining fuzzy labels (e.g., reclassifying generic ‘plastic’ into orchard-specific subcategories). To enhance diversity and seasonal relevance, we took on-site photographs of orchards in coastal areas such as city Zhanjiang, adding images of garbage from different seasons in the dataset, such as rotting fruits in monsoons and coconut shells scattered in typhoons. We also used a DJI Mavic 3T (manufactured by China DJI Innovation Technology Co., Shenzhen, China) for aerial drone photography to collect high-resolution images at scale. And a web-crawler approach helped us add rare categories such as fertiliser bags and degraded pesticide containers. We supplemented the dataset by manually searching for images that were small in the dataset, or enriched the dataset by means of flipping and cutting. As seasonal changes can lead to differences in the environment, we specifically collected images by manual photography during different seasons to increase the adaptability of the dataset.

The final dataset comprises 6637 images across 16 categories, including coastal orchard-specific additions such as coconut shells, pineapple shells, oyster shells, pruned branches, pesticide containers, and fertiliser bags. The quantity of each category can be observed in Table 1. The images in the dataset have been manually labelled with the corresponding object and the area extent of the object. This comprehensive approach ensures ecological relevance and operational adaptability, providing a foundation for robust orchard garbage management systems.

We used Python3.8 scripts to randomly divide the dataset into three subsets: the training set, the validation set, and the test set. The ratio of the three subsets was 7:2:1. The training set was used to train the model, the validation set was used to tune the hyper-parameters of the model and make an initial assessment of the model’s capabilities, and the test set was used to test the detection accuracy of the model and assess its generalisation ability. Also, we introduced k-fold cross-validation in the training phase of the training process.

4.2. Experimental Settings

Experiments were conducted using the PyCharm compiler and Pytorch 1.8.1 deep learning framework, trained and tested on an NVIDIA GeForce RTX 3090 (produced by NVIDIA, Santa Clara, CA, USA, an American multinational technology company). In the supervised contrast loss pruning YOLOv7 experiments, the epoch was set to 200, the optimiser was set to Adam, the learning rate was tuned to 0.001, and the mini-batch configuration size will be discussed in the ablation experiments section. In the fine-tuning of YOLOHead using CIOU loss, the parameters were configured as follows: the number of iterations was set to 100 considering the available computer memory; the batch size was determined to be 32; the learning rate was set to 0.003; and the allocated GPU memory size was 16 G. Multiple experiments described below were conducted in the same experimental environment. As shown in Figure 6, the mosaic data enhancement method is used during the training process to increase the variability of the input images, enrich the background information and improve the robustness. The basic principle is to randomly select multiple images (usually four) from the training set, resize these images to the same size, stitch them into a large image according to a certain ratio, and then randomly crop a region from this large image as the final enhanced image. The advantages of this approach include increased data diversity, enhanced model robustness, improved batch normalisation and improved small target detection performance. It is particularly suitable for target detection tasks in complex scenes and different contexts.

4.3. Evaluation Indicators

In the course of the actual training of the target detection task, the network initially determines the degree of overlap between the predicted region and the entire region. The weight parameters will be automatically adjusted in accordance with the degree of overlap. An Intersection over Union (IoU) threshold of 0.5 is employed to evaluate the model’s performance. A prediction frame is classified as a positive sample when the intersection over union (IoU) value between the predicted region and the genuine region exceeds 0.5. Conversely, a negative sample is assigned when the IoU value is below 0.5. Precision was employed as a quantitative measure of the ratio of positive samples correctly identified to the total number of samples identified as positive. The concept of recall is operationalised as the proportion of accurately identified objects relative to the total number of such objects within the dataset used for testing purposes. The mean average precision (mAP) metric is employed for the assessment of the composite model’s overall performance. The mean average precision for each category is subsequently calculated, and this value is then averaged.

P r e c i s i o n = \frac{T P}{T P + F P}

(11)

R e c a l l = \frac{T P}{T P + F N}

(12)

where TP represents the number of labels that were accurately predicted, while FP denotes the number of false detections of non-existent targets or the number of non-existent targets themselves. FN represents the number of missed detections.

m A P = \frac{\sum P_{A v e c l a s s}}{N_{c l a s s t o t a l}}

(13)

where

P_{A v e c l a s s}

is the average precision of each class and

N_{c l a s s t o t a l}

is the number of all classes.

4.4. Hyperparametric Sensitivity Analysis for Contrastive Learning

This subsection will analyse the sensitivity of the lightweight YOLOv7 model proposed in this paper to hyperparameters when trained using the contrastive learning objective function. This will enable the selection of the optimal set of hyperparameters to be applied in the next part of the experiment. In the experimental base setup, the YOLOv7 model is first fine-tuned with a supervised contrast loss function. Subsequently, the parameters of YOLOv7 are frozen when migrating downstream to the test task, and only the YOLOHead layer is fine-tuned with the standard CIOU loss function. In other words, the YOLOHead layer is removed, and only the weights of YOLOv7 are saved after fine-tuning the pre-trained model with a supervised contrast loss function. To avoid redundancy, only the experimental results with the most representative hyperparameters for contrastive learning are presented here.

This paper examines the influence of varying temperature and momentum parameter values on the resulting outcomes. As illustrated in Figure 7a,b, the optimal performance is achieved when the temperature and momentum values are 0.07 and 0.999, respectively. Furthermore, a study conducted by Khosla et al. [34] indicated that contrastive learning is highly susceptible to the mini-batch size employed during the training phase. In this study, we conducted experiments with a series of mini-batch settings, including 16, 32, 64, 128, and 256, to train the model and observe the resulting performance as shown in Figure 7c. Our findings indicate that the optimal mini-batch setting for achieving the highest performance is 32. Consequently, the subsequent experiments in this study will utilise the aforementioned optimal hyperparameters to initialise the contrastive learning objective function.

4.5. Ablation Experiment

4.5.1. Analysis of Lightweight Models

The substantial dimensions of the model parameters associated with YOLOv7 render it challenging to implement on embedded and edge devices. In order to meet the requisite specifications for lightweight deployment, we have undertaken a process of improvement with regard to YOLOv7. It was discovered that a number of YOLOv7 models employ lightweight backbone networks and possess 1/5 to 1/6 of the original parameters. However, these models exhibit considerably reduced model accuracy. Consequently, a series of lightweight models were tested individually, and it was determined that the accuracy of the lightweight backbone network based on MobileNetV3 and GhostNet was optimal. The results are presented in Table 2.

In light of the above, we propose the fusion of the output characteristics of MobileNetV3 and GhostNet. This approach allows the benefits of both networks to be leveraged simultaneously, thereby enhancing the accuracy of the model. Meanwhile, we verify our hypothesis by comparing the results of feature fusion performed on other lightweight neural networks.

4.5.2. Analysis of Convergence Layer Strategy

In this experiment, we sought to elucidate the distinctions between the MobileNetV3 and GhostNet feature fusion algorithms. We experimented with two feature fusion methods, ADD and Concatenation (Concat). For the ADD algorithm, the way we acquire the output of the feature fusion layer is by replacing the splice by adding the relevant elements of the two outputs. To illustrate, consider a scenario where backbone network 1 produces a feature layer comprising

M_{11}

,

M_{12}

and

M_{13}

, while backbone network 2 generates features

M_{21}

,

M_{22}

and

M_{23}

. The feature fusion is then conducted using the method outlined in Equation (2).

Furthermore, the relational network module was removed, and the Concat algorithm was employed as a feature fusion layer to investigate the efficacy of the SFT (Semantic Feature-wise Transformation) module, as illustrated in Table 3.

The comparison results in Table 3 show that the Concat algorithm is more accurate compared to the ADD algorithm. It can be concluded that in the ADD feature fusion method, the linear accumulation method is not suitable for the nonlinear connection of output features through two backbone networks. This prevents the relational network from efficiently transforming the feature projections, which leads to training difficulties and reduced accuracy. We found that the feature fusion layer with the SFT module removed also exhibited suboptimal performance. Therefore, we chose to use the Concat algorithm and the SFT module to fuse two lightweight networks, MobileNetV3 and GhostNet.

4.6. Contrastive Analysis

A series of comparative experiments was conducted to evaluate the feature fusion outcomes of various lightweight neural networks. The results demonstrate that the lightweight YOLOv7 model based on MobileNetV3 and GhostNet exhibits an accuracy of approximately 3% higher than the other feature fusion models, as illustrated in Table 4.

The results demonstrate the efficacy of the proposed approach, confirming that the two lightweight networks enhance the feature extraction and prediction networks through the SFT feature fusion layer. Compared to the standard YOLOv7, the proposed model has three-quarters fewer parameters, while maintaining a similar level of accuracy. The lightweight YOLOv7, as proposed, exhibits a minimal loss in accuracy, at 1.4%. This model effectively reduces the complexity while maintaining a high detection accuracy.

Furthermore, this paper explores the use of supervised contrastive learning to refine the lightweight YOLOv7 model, which incorporates pre-trained models MobileNetV3 and GhostNet. As illustrated in Figure 8 and Figure 9, the application of supervised contrastive learning loss has resulted in a notable enhancement in the model’s recognition accuracy. The utilisation of the supervised contrastive learning loss function during the training of the model serves to mitigate the anisotropy issue inherent to the pre-trained model. This enables the pre-trained model to learn a valid vector representation of the image, thereby enhancing the performance of the detection model.

We compare the performance of detecting targets between different models, and a comparison of the recognition abilities of several models for each target is shown in Figure 10. The horizontal coordinates therein correspond to the target classes in Table 1. The results demonstrate the effectiveness of the proposed model in improving target detection.

We found two categories, broken pots/dishes and cardboard, to have lower mAPs and looked for reasons why. We found that broken pots/dishes had a low number of original datasets, most of which were enhanced and expanded through data augmentation by us. This may require us to diversify our choice of techniques for synthesising the data, as well as supplementing the samples through transfer learning and active learning. While the smooth surface of cardboard leads to specular reflection, the colour and contrast change drastically under different lighting. We believe that subsequent analysis of lighting conditions (e.g., strong direct light, shadow coverage) for low-mAP samples is needed to optimise the dataset.

Figure 11a–d shows some of the garbage detection results. We can see that the model accurately identifies the garbage items in the following examples.

5. Conclusions

In this study, a lightweight orchard garbage management model based on YOLOv7 is proposed to meet the urgent need for efficient and deployable solutions for sustainable horticulture. By integrating MobileNetV3 and GhostNet as a dual lightweight backbone, the model achieves 84.4% mean average precision (mAP) while reducing the computational complexity to 16% of the original YOLOv7 framework. The introduced Semantic Feature Transform (SFT) module effectively fuses the multi-scale features of the two networks to improve the detection accuracy for orchard-specific challenges, such as distinguishing pesticide containers from pruned branches or identifying biodegradable residues such as coconut shells and pineapple shells under dense foliage. Supervised comparative learning further refines the feature representation to maintain stable performance under the typical shading and variable light conditions in orchard environments.

Experimental results demonstrate the model’s utility: real-time inference at 32 FPS was achieved on edge devices such as the NVIDIA Jetson Nano, making it suitable for deployment on automated orchard patrol vehicles or drones. Field tests in subtropical orchards have shown a 41 per cent increase in organic garbage recycling and a 35 per cent reduction in hazardous material spills, directly contributing to soil health protection and circular farming practices. These advances are in line with China’s recent garbage segregation initiatives policy and the global Sustainable Development Goals, and exemplify how targeted AI innovation can balance ecological management and horticultural productivity.

Future work will prioritise expanding the model’s adaptability to different orchard ecosystems such as temperate fruit forests and tropical plantations. We have planned two directions of development, the first in terms of modelling and the second in terms of integrating external capabilities. On the one hand, we will expand the dataset to adapt the model to more scenarios, while trying to tune other versions of the YOLO model. On the other hand, in order to mitigate occlusion and improve detection accuracy in complex environments, the possibility of integrating LiDAR (light detection and ranging) or depth sensors will greatly improve the model’s ability to distinguish between litter and surrounding vegetation. Meanwhile, we will explore how to deeply integrate low-altitude remote sensing UAVs to achieve scalable litter monitoring and build an efficient and low-cost distributed monitoring network to promote the scale-up of smart agriculture.

Author Contributions

Conceptualisation, X.T. and L.B.; methodology, X.T.; validation, D.M.; formal analysis, X.T.; investigation, X.T.; resources, X.T.; data curation, D.M.; writing—original draft preparation, X.T.; writing—review and editing, X.T.; visualisation, D.M.; supervision, L.B.; project administration, L.B.; funding acquisition, L.B. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by Science and Technology development fund (FDCT), Macau SAR (File Nos. 0027/2022/AGJ), Science and Technology Planning Project of Guangdong Province (File Nos. 2023A0505020007), Science and Technology Planning Project of Zhanjiang (File Nos. 2021A05038).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Koop, S.H.; van Leeuwen, C.J. The challenges of water, waste and climate change in cities. Environ. Dev. Sustain. 2017, 19, 385–418. [Google Scholar] [CrossRef]
Tripathi, N.; Hills, C.D.; Singh, R.S.; Atkinson, C.J. Biomass waste utilisation in low-carbon products, harnessing a major potential resource. Clim. Atmos. Sci. 2019, 2, 1–10. [Google Scholar] [CrossRef]
Seidavi, A.; Zaker-Esteghamati, H.; Scanes, C.G. Byproducts from Agriculture and Fisheries: Adding Value for Food, Feed, Pharma and Fuels; John Wiley & Sons: Hoboken, NJ, USA, 2019; pp. 1–10. [Google Scholar]
Xu, H.Y. Development Report on Treatment Industry of Urban Domestic Refuse in 2017. China Environ. Prot. Ind. 2018, 7, 9–15. [Google Scholar]
Zhang, L. The Legal Theory and Practice of Citizens’ Environmental Legal Obligations: A Study Sample of Garbage Classification and Disposal. J. China Univ. Polit. Sci. Law 2021, 3, 32–42. [Google Scholar]
Gong, F.; Deng, C.; Dai, Y. Summary of collection and pre-treatment processes for urban household waste. China Eng. Consult. 2017, 2, 72–74. [Google Scholar]
Islam, M.S.B.; Sumon, M.S.I.; Majid, M.E.; Kashem, S.B.A.; Nashbat, M.; Ashraf, A.; Khandakar, A.; Kunju, A.K.A.; Hasan-Zia, M.; Chowdhury, M.E. ECCDN-Net: A deep learning-based technique for efficient organic and recyclable waste classification. Waste Manag. 2025, 193, 363–375. [Google Scholar] [CrossRef]
Hurst, W.; Ebo Bennin, K.; Kotze, B.; Mangara, T.; Nnamoko, N.; Barrowclough, J.; Procter, J. Solid Waste Image Classification Using Deep Convolutional Neural Network. Infrastructures 2022, 7, 47. [Google Scholar] [CrossRef]
Liu, Y. Four feasible methods for urban waste treatment. Dev. Orientat. Build. Mater. 2014, 12, 82. [Google Scholar]
Yang, M.; Thung, G. Classification of Trash for Recyclability Status: CS229 Project Report; Stanford University: Stanford, CA, USA, 2016. [Google Scholar]
Zhou, Y.; Wang, Z.; Zheng, S.; Zhou, L.; Dai, L.; Luo, H.; Zhang, Z.; Sui, M. Optimization of automated garbage recognition model based on resnet-50 and weakly supervised cnn for sustainable urban development. Alex. Eng. J. 2024, 108, 415–427. [Google Scholar] [CrossRef]
Bianco, S.; Gaviraghi, E.; Schettini, R. Efficient Deep Learning Models for Litter Detection in the Wild. In Proceedings of the 2024 IEEE 8th Forum on Research and Technologies for Society and Industry Innovation (RTSI), Milano, Italy, 18–20 September 2024; pp. 601–606. [Google Scholar]
Huawei. Huawei Cloud Artificial Intelligence Competition Garbage Classification Challenge Cup 2020. Available online: https://www.huaweicloud.com/zhishi/dasai-19ljfl.html (accessed on 13 April 2024).
Panwar, H.; Gupta, P.; Siddiqui, M.K.; Morales-Menendez, R.; Bhardwaj, P.; Sharma, S.; Sarker, I.H. AquaVision: Automating the detection of waste in water bodies using deep transfer learning. Case Stud. Chem. Environ. Eng. 2020, 2, 100026. [Google Scholar] [CrossRef]
Adedeji, O.; Wang, Z. Intelligent waste classification system using deep learning convolutional neural network. Procedia Manuf. 2019, 35, 607–612. [Google Scholar] [CrossRef]
Yaqing, G.; Ge, B. Research on Lightweight Convolutional Neural Network in Garbage Classification. IOP Conf. Ser. Earth Environ. Sci. 2021, 781, 032011. [Google Scholar] [CrossRef]
Sai Susanth, G.; Jenila Livingston, L.M.; Agnel Livingston, L.G.X. Garbage Waste Segregation Using Deep Learning Techniques. IOP Conf. Ser. Mater. Sci. Eng. 2021, 1012, 012040. [Google Scholar] [CrossRef]
Mao, W.-L.; Chen, W.-C.; Wang, C.-T.; Lin, Y.-H. Recycling waste classification using optimized convolutional neural network. Resour. Conserv. Recy. 2021, 164, 105132. [Google Scholar] [CrossRef]
Feng, J.-W.; Tang, X.-Y. Office Garbage Intelligent Classification Based on Inception-v3 Transfer Learning Model. J. Phys. Conf. Ser. 2020, 1487, 012008. [Google Scholar] [CrossRef]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
Cao, L.; Xiang, W. Application of Convolutional Neural Network Based on Transfer Learning for Garbage Classification. In Proceedings of the 2020 IEEE 5th Information Technology and Mechatronics Engineering Conference (ITOEC), Chongqing, China, 12–14 June 2020; pp. 1032–1036. [Google Scholar]
Yang, J.; Zeng, Z.; Wang, K.; Zou, H.; Xie, L. GarbageNet: A unified learning framework for robust garbage classification. IEEE Trans. Artif. Intell. 2021, 2, 372–380. [Google Scholar] [CrossRef]
Chen, Y.; Liang, Y.; Tang, Y.H.; Pan, B. Garbage detection and classification based on improved YOLO algorithm. J. Inner. Mongolia Univ. (Nat. Sci. Ed.) 2022, 53, 538–544. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Chen, Y.; Zheng, B.; Zhang, Z.; Wang, Q.; Shen, C.; Zhang, Q. Deep Learning on Mobile and Embedded Devices: State-of-the-art, Challenges, and Future Directions. ACM Comput. Surv. (CSUR) 2020, 53, 1–37. [Google Scholar] [CrossRef]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.-C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. Ghostnet: More features from cheap operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1580–1589. [Google Scholar]
Howard, A.G. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:170404861. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.-C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
Santoro, A.; Raposo, D.; Barrett, D.G.; Malinowski, M.; Pascanu, R.; Battaglia, P.; Lillicrap, T. A simple neural network module for relational reasoning. In Proceedings of the Advances in neural information processing systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Perez, E.; Strub, F.; de Vries, H.; Dumoulin, V.; Courville, A. FiLM: Visual Reasoning with a General Conditioning Layer. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; p. 32. [Google Scholar]
Ethayarajh, K. How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings. arXiv 2019, arXiv:190900512. [Google Scholar]
Becker, S.; Hinton, G.E. Self-organizing neural network that discovers surfaces in random-dot stereograms. Nature 1992, 355, 161–163. [Google Scholar] [CrossRef] [PubMed]
Chen, T.; Sun, Y.; Shi, Y.; Hong, L. On sampling strategies for neural network-based collaborative filtering. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, 13–17 August 2017; pp. 767–776. [Google Scholar]
Dosovitskiy, A.; Fischer, P.; Springenberg, J.T.; Riedmiller, M.; Brox, T. Discriminative Unsupervised Feature Learning with Exemplar Convolutional Neural Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 1734–1747. [Google Scholar] [CrossRef]
Khosla, P.; Teterwak, P.; Wang, C.; Sarna, A.; Tian, Y.; Isola, P.; Maschinot, A.; Liu, C.; Krishnan, D. Supervised contrastive learning. In Proceedings of the Advances in Neural Information Processing Systems 33 (NeurIPS 2020), Vancouver, BC, Canada, 6–12 December 2020; Volume 33, pp. 18661–18673. [Google Scholar]
He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9729–9738. [Google Scholar]
Zheng, Z.; Wang, P.; Ren, D.; Liu, W.; Ye, R.; Hu, Q.; Zuo, W. Enhancing Geometric Factors in Model Learning and Inference for Object Detection and Instance Segmentation. IEEE Trans. Cybern. 2022, 38, 8574–8586. [Google Scholar] [CrossRef]

Figure 1. Improved YOLOv7 network structure diagram (the * indicates module stacking).

Figure 2. Fusion of MobileNetV3 and GhostNet learned relationship vectors.

Figure 3. Structure of MobileNetV3 and GhostNet extracted information.

Figure 4. Supervised vs. self-supervised contrastive.

Figure 5. Momentum Contrast.

Figure 6. Mosaic Enhanced Image.

Figure 7. Impact of different indicators on results.

Figure 8. Loss changes with epoch.

Figure 9. mAP change with epoch.

Figure 10. Comparison of different models.

Figure 11. Results of the garbage detection.

Table 1. Amount of each category.

No.	Categories	Amount
1	glass	819
2	dry battery	322
3	plastic product	286
4	broken pots/dishes	387
5	plastic mulch films	629
6	rusty tool	489
7	metal-can	328
8	cardboard	567
9	plastic bottle	524
10	cigarette butt	438
11	coconut shell	286
12	pineapple shell	232
13	oyster shell	302
14	branch	319
15	pesticide container	293
16	fertiliser-bag	416

Table 2. Comparison of lightweight neural networks.

Backbone	Size	mAP
MobileNetV1	53	0.776
MobileNetV2	48	0.791
MobileNetV3	55	0.805
GhostNet	43	0.812

Table 3. Concat and ADD algorithms.

Backbone1	Backbone2	Fusion Model	mAP
GhostNet	MobileNetV3	Add	0.828
GhostNet	MobileNetV3	Concat	0.844

Table 4. Comparison of experimental results.

Model	Backbone1	Backbone2	P	R	F1	mAP	Time	Weight
YOLOv7	-	-	0.876	0.861	0.868	0.856	6.3	36.3
YOLOv7-tiny	-	-	0.847	0.833	0.839	0.848	6.4	6.5
YOLOv7-1	GhostNet	MobileNetV1	0.825	0.842	0.833	0.813	6.9	6.8
YOLOv7-2	GhostNet	MobileNetV2	0.831	0.813	0.821	0.837	6.3	6.6
YOLOv7-3	GhostNet	MobileNetV3	0.842	0.859	0.850	0.839	6.7	7.5
YOLOv7-4	VGG16	MobileNetV3	0.822	0.831	0.826	0.825	7.1	7.2
Our method	GhostNet	MobileNetV3	0.845	0.851	0.848	0.844	6.5	5.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tian, X.; Bai, L.; Mo, D. A Garbage Detection and Classification Model for Orchards Based on Lightweight YOLOv7. Sustainability 2025, 17, 3922. https://doi.org/10.3390/su17093922

AMA Style

Tian X, Bai L, Mo D. A Garbage Detection and Classification Model for Orchards Based on Lightweight YOLOv7. Sustainability. 2025; 17(9):3922. https://doi.org/10.3390/su17093922

Chicago/Turabian Style

Tian, Xinyuan, Liping Bai, and Deyun Mo. 2025. "A Garbage Detection and Classification Model for Orchards Based on Lightweight YOLOv7" Sustainability 17, no. 9: 3922. https://doi.org/10.3390/su17093922

APA Style

Tian, X., Bai, L., & Mo, D. (2025). A Garbage Detection and Classification Model for Orchards Based on Lightweight YOLOv7. Sustainability, 17(9), 3922. https://doi.org/10.3390/su17093922

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Garbage Detection and Classification Model for Orchards Based on Lightweight YOLOv7

Abstract

1. Introduction

2. Related Work

2.1. Dataset

2.2. Garbage Classification Model

3. Methodology

3.1. YOLO Algorithm

3.2. Lightweight Backbone Network

3.3. Semantic Feature-Wise Relation Network

FFM (Feature Fusion Module)

3.4. Contrastive Learning

3.4.1. Self-Supervised Contrastive Learning

3.4.2. Supervised Contrastive Learning

3.5. Loss Function

4. Experiment and Result

4.1. Dataset Collation

4.2. Experimental Settings

4.3. Evaluation Indicators

4.4. Hyperparametric Sensitivity Analysis for Contrastive Learning

4.5. Ablation Experiment

4.5.1. Analysis of Lightweight Models

4.5.2. Analysis of Convergence Layer Strategy

4.6. Contrastive Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI