YOLOv8n-WSE-Pest: A Lightweight Deep Learning Model Based on YOLOv8n for Pest Identification in Tea Gardens

Li, Hongxu; Yuan, Wenxia; Xia, Yuxin; Wang, Zejun; He, Junjie; Wang, Qiaomei; Zhang, Shihao; Li, Limei; Yang, Fang; Wang, Baijuan

doi:10.3390/app14198748

Open AccessArticle

YOLOv8n-WSE-Pest: A Lightweight Deep Learning Model Based on YOLOv8n for Pest Identification in Tea Gardens

by

Hongxu Li

^1,2

,

Wenxia Yuan

^1,2,

Yuxin Xia

^1,2,

Zejun Wang

^1,2,

Junjie He

^1,2,

Qiaomei Wang

^1,2,

Shihao Zhang

³,

Limei Li

^1,2,

Fang Yang

^1,2 and

Baijuan Wang

^1,2,*

¹

College of Tea Science, Yunnan Agricultural University, Kunming 650201, China

²

Yunnan Organic Tea Industry Intelligent Engineering Research Center, Yunnan Agricultural University, Kunming 650201, China

³

College of Mechanical and Electrical Engineering, Wuhan Donghu University, Wuhan 430071, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(19), 8748; https://doi.org/10.3390/app14198748

Submission received: 20 August 2024 / Revised: 23 September 2024 / Accepted: 25 September 2024 / Published: 27 September 2024

(This article belongs to the Section Agricultural Science and Technology)

Download

Browse Figures

Versions Notes

Abstract

:

China’s Yunnan Province, known for its tea plantations, faces significant challenges in smart pest management due to its ecologically intricate environment. To enable the intelligent monitoring of pests within tea plantations, this study introduces a novel image recognition algorithm, designated as YOLOv8n-WSE-pest. Taking into account the pest image data collected from organic tea gardens in Yunnan, this study utilizes the YOLOv8n network as a foundation and optimizes the original loss function using WIoU-v3 to achieve dynamic gradient allocation and improve the prediction accuracy. The addition of the Spatial and Channel Reconstruction Convolution structure in the Backbone layer reduces redundant spatial and channel features, thereby reducing the model’s complexity. The integration of the Efficient Multi-Scale Attention Module with Cross-Spatial Learning enables the model to have more flexible global attention. The research results demonstrate that compared to the original YOLOv8n model, the improved YOLOv8n-WSE-pest model shows increases in the precision, recall, mAP50, and F1 score by 3.12%, 5.65%, 2.18%, and 4.43%, respectively. In external validation, the mAP of the model outperforms other deep learning networks such as Faster-RCNN, SSD, and the original YOLOv8n, with improvements of 14.34%, 8.85%, and 2.18%, respectively. In summary, the intelligent tea garden pest identification model proposed in this study excels at precise the detection of key pests in tea plantations, enhancing the efficiency and accuracy of pest management through the application of advanced techniques in applied science.

Keywords:

agricultural science; deep learning; tea garden pests; object detection; YOLOv8n

1. Introduction

Yunnan Province, located across the Tropic of Cancer, is the birthplace of Pu’er tea and boasts a vast area of favorable climate [1,2]. With abundant rainfall and excellent ecological conditions, it has become an ideal place for tea cultivation [3,4]. However, the favorable ecological environment has also fostered a diverse range of pests, leading to aggravated pest infestation in tea gardens [5,6]. The status of pests and diseases in organic tea gardens directly affects the yield and quality of tea leaves [7,8,9]. With the development of applied science in recent years, the adoption of advanced agricultural science and technology, including high-efficiency, targeted smart pest management techniques, has become an indispensable component in addressing insect pest challenges within Yunnan’s tea plantations [10,11].

The challenges faced in the realm of intelligent and precise pest control encompass the accurate identification of pests and the difficulties in deployment due to the high complexity of models [12,13,14]. Traditional recognition algorithms typically necessitate manual feature extraction and the use of basic classifiers, which, when tasked with identifying complex pest and disease samples in the variable agricultural environment, limit the accuracy and generalizability of the model. [15,16]. Moreover, the complexity of these models imposes constraints on their practical application due to limitations in computational power and hardware conditions [17,18]. The limitations in feature engineering and insufficient reliance on extensive datasets further impede the ability to rapidly and accurately identify pests and diseases [19,20].

In recent years, with the development of technologies such as deep learning, the Internet of Things, and big data, the field of intelligent pest control in agriculture has progressed rapidly [21,22,23]. Currently, identification algorithms for tea garden pests have largely transitioned to employing deep learning models, with YOLO (You Only Look Once) emerging as a predominant methodology. This deep learning-based algorithm, rooted in advanced convolutional neural network architectures, excels in analyzing pest image characteristics and accurately differentiating between various pest species [24,25,26,27]. Yunong Tian et al. proposed MD-YOLO, Multi-scale Dense YOLO, for small target pest detection, achieving an mAP value of 86.2%; this model enhances the accurate identification of pests across various scales, thereby improving both the efficiency and accuracy of detection tasks [28]. Shifeng Dong et al. developed ESA-Net, an efficient scale-aware network for small-crop pest detection, which achieved an mAP50 value of 68.8% on the LMPD2020 dataset and 75.3% on the APHIDc dataset, underscoring the efficacy of this YOLO variant in addressing variations in pest sizes and reinforcing detection capabilities [29]. Zhe Tang et al. introduced Pest-YOLO: Deep Image Mining and Multi-Feature Fusion for Real-Time Agriculture Pest Detection. A recall rate of 83.5% highlights the notable contribution of this YOLO derivative in broadening the scope of detection and mitigating missed detections [30]. Zhichao Chen et al. proposed TeaViTNet: Tea Disease and Pest Detection Model Based on Fused multi-scale Attention. Achieving an average accuracy of 89.1%, this innovative application of the YOLO framework not only augments detection precision but also bolsters the model’s ability to capture pest features within complex scenarios through its multi-scale attention mechanism [31]. While acknowledging the advancements made, the existing pest identification models grapple with limitations pertaining to regional adaptability, the inadequate capture of features, and elevated costs. Specifically, these models encounter difficulties when deployed across different regions and struggle to precisely extract vital information due to ecological and environmental disparities, thereby undermining their recognition accuracy. Furthermore, the pursuit of enhanced precision has led to a rise in model complexity, which, in turn, curtails the economic viability and practical feasibility of their real-world implementation [32].

In pursuit of enhanced accuracy in pest identification, this study aims to introduce an innovative image recognition algorithm, YOLOv8n-WSE-pest, designed for deployment in geographical environments such as tea plantations, with a focus on performance efficiency and lightweight model architecture. The algorithm is grounded in a dataset of pest images collected from organic tea gardens in Yunnan Province, leveraging the YOLOv8n network as its foundational framework. By optimizing the original loss function with WIoU-v3, the algorithm achieves dynamic gradient allocation and enhanced predictive accuracy. Furthermore, our research incorporates Spatial and Channel Reconstruction Convolution in the Backbone layer to reduce spatial and channel redundancy, thereby simplifying the model’s structure. Concurrently, an Efficient Multi-Scale Attention Module with Cross-Spatial Learning is integrated into the Neck layer, enabling the model to allocate global attention more flexibly [33].

2. Materials and Methods

2.1. Data Collection

This study collected pest images from three locations in Yunnan Province, China: the He Kai Base of Yuecheng Technology Co., Ltd. (Menghai County, 100.46° N, 21.78° E, China); the tea gardens in Sanjia Village (Nanhua County, 100.76° N, 24.78° E, China); and the teaching and research base of the Tea Science College at Yunnan Agricultural University (Kunming, 102.75° N, 25.13° E, China).

During the image collection phase, a specific approach was employed to capture images of fast-moving insects. Sticky traps were placed on tea trees to attract a large number of insects for photography under supplementary lighting. On the other hand, slow-moving insects were directly photographed above the leaves. A mobile device with the macro lens APL-MS002BK (APEXEL, Shenzhen, China) was used to capture magnified images of the pests at a 200× zoom level. The lens had a focal length of 40.6 mm, an exit pupil diameter of 1.45 mm, a focusing distance of 0.3 mm, and a brightness range of 85 to 150 lm. The color temperature was set between 6000 and 6500 k. After capturing the photos, they were transmitted to the terminal for subsequent image processing. The data collection process is illustrated in Figure 1.

For the four types of pests, Toxoptera aurantii, Xyleborus fornicatus Eichhoffr, Empoasca pirisuga Matumura, and Malthodes discicollis Baudi di Selve, this study collected 133, 141, 128, and 143 original images, respectively, totaling 545 tea garden pest images, which served as training samples for the model recognition. Toxoptera aurantii is among the most detrimental pests to tea plants, severely impacting their health by sucking phloem sap and excreting honeydew that increases the risk of sooty mold fungi [34]. Xyleborus fornicatus Eichhoff, primarily parasitic on tea and rubber trees, inflicts substantial branch damage, with infestation rates that can reach 50–65%, sometimes leading to complete plant desiccation [35]. Empoasca pirisuga Matsumura affects tea plant growth by piercing and sucking the sap from buds and leaves, resulting in a decline in tea quality. Its small size and rapid reproduction rate render control measures challenging [36]. Malthodes discicollis Baudi di Selve exhibits over-predation behavior in tea gardens, which includes consuming beneficial insects like spiders [37,38,39]. Figure 2 showcases some of the collected original image samples.

2.2. Data Augmentation

To prevent overfitting during model training, this study ensured an adequate number of images. Image augmentation techniques were employed to expand the dataset. Contrast enhancement and contrast reduction were applied to enhance and reduce contrast by 40%, respectively. Brightness enhancement and brightness reduction were used to increase and decrease brightness by 40%, respectively. Additionally, random rotation was applied to randomly change the orientation of the images. Through these methods, the original data were expanded by a factor of 10, resulting in a total of 5450 images. Figure 3 illustrates several image augmentation techniques employed specifically for Malthodes discicollis Baudi di Selve.

In the pursuit of enhancing the dataset’s utility for training purposes, a stringent quality control measure was implemented, leading to the removal of a total of 127 low-quality images. Figure 4 highlights several prominent examples of such subpar images that were deemed unusable for training the model. This resulted in a final dataset of 5323 high-quality images. The dataset consisted of 1332 images of Toxoptera aurantii, 1335 images of Xyleborus fornicatus Eichhoffr, 1312 images of Empoasca pirisuga Matumura, and 1344 images of Malthodes discicollis Baudi di Selve. These images formed the dataset used in this study.

The images were annotated using the Labelimg technique. The dataset for the four pest categories was divided in a ratio of 6:2:2, representing the training set, testing set, and validation set, respectively. The distribution of the pest dataset and the label distribution are presented in Table 1.

2.3. YOLOv8 Network Improvement

YOLOv8 (You Only Look Once version 8) is an object detection algorithm from the YOLO series developed by Ultralytics. Among the YOLOv8 variants, YOLOv8n (YOLOv8 Nano) is the lightest version [40]. The model consists of an input layer and a network structure composed of Backbone, Neck, and Head layers [41]. The input layer defines the input size and data type of the model. The Backbone module serves as a feature extractor and is responsible for capturing contextual and abstract information from the input image by extracting high-level semantic features. The Neck module further integrates the features extracted by the Backbone module to fuse different levels of features. The Head module includes bounding box regression and classification layers, which are used for predicting the position and category, thereby generating the final object detection results.

To enhance the model’s ability to extract key features, adapt to multi-scale variations, reduce the parameter count to lower computational costs, and ensure good generalization with low complexity, this study builds upon the YOLOv8n model. It optimizes the original loss function WIoU-v3 using dynamic gradient allocation. Additionally, it improves the network structure of the Backbone layer by employing Spatial and Channel Reconstruction Convolution for feature redundancy. Furthermore, an Efficient Multi-Scale Attention Module with Cross-Spatial Learning is incorporated into the Neck layer. The improved network structure is illustrated in Figure 5.

2.3.1. Improvement of Loss Function

IoU (Intersection over Union) is a simple function used to compute the localization loss by evaluating the degree of overlap between two bounding boxes [42,43]. The YOLOv8n network employs the CIoU (Complete IoU) as its loss function, which primarily focuses on the IoU between predicted and ground truth boxes. However, it fails to fully capture the relative position and size differences between objects and is not suitable for insect images with missing body parts. Moreover, when dealing with low-quality data sources, the original loss function cannot accurately measure the performance of object detection.

To address these issues, this study improves the YOLOv8n’s default loss function using the dynamic non-monotonic focal mechanism, WIoU-v3 [44]. This enhanced loss function aims to overcome the limitations of the original loss function by considering the relative position and size differences between objects. The schematic diagram of this loss function is depicted in Figure 6.

The WIoU-v3 formulation is defined by Equations (1)–(3), where its core idea is to replace IoU with outliers to assess the quality of anchor boxes. The degree of outliers in anchor boxes is represented by the ratio of

L_{I o U}^{*}

to

\bar{L_{I o U}}

. Here,

L_{I o U}^{*}

denotes the monotonically increasing focus coefficient of the original

L_{W I o U - v 1}

, while

\bar{L_{I o U}}

represents the sliding average with momentum

m

. The non-monotonic focus coefficient

r

is constructed using the degree of outliers, denoted by

β

. Additionally,

α

and

δ

are hyperparameters involved in the formulation.

β = \frac{L_{I o U}^{*}}{\bar{L_{I o U}}} \in [0, + \infty)

(1)

m = 1 - \sqrt[m]{0.05}

(2)

L_{W I o U v 3} = r L_{W I o U v 1}, r = \frac{β}{δ α^{β - δ}}

(3)

Applying the coefficient

r

to

L_{W I o U - v 1}

, which is defined by Equations (4) and (5),

W_{g}

and

H_{g}

represent the width and height of the bounding.

R_{W I o U} \in [1, e)

denotes the regular anchor box

L_{I o U}

with enhanced quality, while

L_{I o U} \in [0, 1]

represents the high-quality anchor box

R_{W I o U}

with reduced quality. Due to the dynamic nature of

\bar{L_{I o U}}

,

L_{W I o U - v 3}

is capable of a real-time gradient allocation strategy. When the value of

β

is small, the anchor box has a high quality, and the loss function assigns it a smaller gradient gain, focusing the bounding box regression on anchor boxes of regular quality. When the value of

β

is large, the anchor box has a low quality, and the loss function assigns it a smaller gradient gain to avoid harmful gradients caused by low-quality anchor boxes while still maintaining attention on anchor boxes of regular quality. This loss method allows for a greater emphasis on anchor samples of regular quality, thereby improving the overall performance of the network model.

L_{W I o U - v 1} = R_{W I o U} L_{I o U}

(4)

R_{W I o U} = e x p (\frac{{(x - x_{g t})}^{2} + {(y - y_{g t})}^{2}}{{(W_{g}^{2} + H_{g}^{2})}^{*}})

(5)

2.3.2. Spatial and Channel Reconstruction Convolution for Feature Redundancy

In the pursuit of augmenting the operational efficiency of the YOLOv8n network, the present investigation introduces SCConv (Spatial and Channel Reconstruction Convolution for feature redundancy). This strategic insertion precedes the SPPF module within the pre-existing Backbone architecture, effectively mitigating the propagation of redundant information throughout the feature extraction phase and concurrently attenuating the computational expenditure. The enhancement, which is not a substitution but rather an augmentation of the original framework, maintains the streamlined profile of the model, thereby preserving its lightweight characteristic [45]. The core principle of the module is illustrated in Figure 7, where SCConv consists of two modules, namely the SRU (Spatial Reconstruction Unit) and CRU (Channel Reconstruction Unit), connected in sequential order. These modules aim to reduce the spatial and channel feature redundancy found in the standard convolutional operations.

The core step of the SRU is the separate–reconstruct process, which aims to suppress spatial redundancy in the feature maps. The schematic diagram is depicted in Figure 8. The separation operation in the SRU primarily employs GN (Group Normalization) to evaluate the information content of the feature maps and obtain the feature

X_{o u t}

.

μ

and

σ

represent the mean and standard deviation of the input feature maps, respectively, while

ε

is a small positive constant added for stable division.

γ

and

β

denote trainable affine transformations.

X_{o u t} = G N (X) = γ \frac{x - μ}{\sqrt{σ^{2}} + ε} + β

(6)

The channel weights

W_{γ}

are obtained by normalizing randomly initialized parameters. Here,

W_{γ}

represents the set of weights assigned to each channel, where C denotes the total number of channels.

W_{γ} = \{w_{i}\} = \frac{γ_{i}}{\sum_{j = 1}^{c} γ_{j}}, (i, j = C)

(7)

The product of

W_{γ}

and

X_{o u t}

is mapped to the range (0, 1) through the Sigmoid function. A threshold gate is used to assign a weight of

W_{1}

to high-information expression features with

W_{γ}

values above the threshold and a weight of

W_{2}

to irrelevant features with

W_{γ}

values below the threshold.

W_{n} = G a t e (S i g m o i d (W_{y} (X_{o u t})))

(8)

\{\begin{matrix} W_{1}, W_{γ} > t h r e s h o l d \\ W_{2}, W_{γ} < t h r e s h o l d \end{matrix}

(9)

The reconstruct operation combines the information-rich feature

X_{1}^{w}

and the information-poor feature

X_{2}^{w}

in a cross-reconstruction manner, enhancing the information flow between them. This process aims to generate more informative features while suppressing redundant ones, resulting in a spatially enriched feature map

X^{w}

.

\{\begin{matrix} X_{1}^{w} = W_{1} \otimes X, \\ X_{2}^{w} = W_{2} \otimes X, \\ X_{11}^{w} \oplus X_{22}^{w} = X^{w 1}, \\ X_{21}^{w} \oplus X_{12}^{w} = X^{w 2}, \\ {[X}^{w 1}, X^{w 2}] = X^{w} . \end{matrix}

(10)

“

\otimes

” represents element-wise multiplication, which is used to apply a weight matrix to the original feature map

X

. “

\oplus

” represents a fusion concatenation operation, which is used to merge feature vectors along the spatial dimension into a new feature vector. The use of square brackets signifies the concatenation operation, where two feature vectors are concatenated along the channel dimension.

The core steps of the CRU involve segmentation, transformation, and fusion, aiming to reduce channel redundancy in the feature maps. The schematic diagram is illustrated in Figure 9.

The segmentation operation splits the feature map

X^{w}

into two channel dimensions,

α C

and

(1 - α) C

. Furthermore, a 1 × 1 convolution is applied to compress the number of channels, reducing the computational costs. This process divides

X^{w}

into a high-channel feature extractor,

X_{u p}

, and a shallow-channel feature extractor,

X_{l o w}

.

In the transformation stage,

X_{u p}

utilizes an efficient convolution technique called GWC (Group-wise Convolution) combined with PWC (Point-wise Convolution) to replace the standard K × K convolution. This approach extracts representative feature information while reducing computational overhead. The outputs are then added and merged into a rich feature map,

Y_{1}

.

X_{l o w}

, on the other hand, employs a 1 × 1 PWC to generate complementary feature maps for

X_{u p}

. The generated features are concatenated with reused features along the channel dimension, resulting in a shallow-level hidden feature map,

Y_{2}

.

Y_{1} = M^{G} X_{u p} + M^{P_{1}} X_{u p}

(11)

[M^{P_{2}} X_{l o w}, X_{l o w}] = Y_{2}

(12)

M^{G}

,

M^{P_{1}}

, and

M^{P_{2}}

represent learnable weight matrices, “+” denotes the pixel-wise addition operation, and the use of square brackets signifies the concatenation operation, where two feature vectors are concatenated along the channel dimension.

After the transformation, a pooling operation is applied to the input feature maps,

Y_{m}

, to obtain channel descriptors,

S_{m}

, which possess global spatial feature information. Here,

H

represents the height of the spatial dimension of the feature map, while

W

represents the width of the spatial dimension. Ym represents the feature maps before the pooling operation, and through pooling,

S_{m}

is generated by aggregating information from spatial regions of

Y_{m}

. The descriptors,

S_{1}

and

S_{2}

, from the upper and lower channels of

Y_{m}

, respectively, are stacked together. Channel soft attention is then performed on the stacked descriptors to generate important feature vectors,

β_{1}

and

β_{2}

. Finally, the results of this attention mechanism,

Y_{1}

and

Y_{2}

, are fused to produce a channel-rich feature map,

Y

.

S_{m} = P o o l i n g (Y_{m}) = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} Y_{c} (i, j)

(13)

β_{1} = \frac{e^{s 1}}{e^{s 1} + e^{s 2}}, β_{2} = \frac{e^{s 2}}{e^{s 1} + e^{s 2}}, β_{1} + β_{2} = 1

(14)

Y = β_{1} Y_{1} + β_{2} Y_{2}

(15)

2.3.3. Efficient Multi-Scale Attention Module with Cross-Spatial Learning

To address the issue of increasing redundant information and the subsequent loss of important information and reduced recognition accuracy as the network depth increases, this paper further introduces EMA (Efficient Multi-Scale Attention Module with Cross-Spatial Learning) into the Neck layer. By incorporating the EMA module, the model’s recognition accuracy is improved while maintaining higher attention weights and enhancing model lightweighting.

The EMA attention mechanism consists of a 1 × 1 branch, a 3 × 3 branch, and a Cross-Spatial Learning module, as depicted in Figure 10. Firstly, to capture different feature semantics, the input feature map is divided into G sub-features

X = [X_{0}, X_{i}, \dots, X_{G - 1}], X_{i} ϵ R^{C ∕ ∕ G \times H \times W}

along the channel direction. In the 1 × 1 branch, global adaptive average pooling is applied along the X and Y directions to encode the channels. The encoded images are then concatenated and shared through a 1 × 1 convolution. The output of the convolution is decomposed into two resolution vectors, which are fitted using the Sigmoid activation function. The fitted vectors undergo re-weight adaptive feature selection before being outputted by the 1 × 1 branch. In the 3 × 3 branch, a 3 × 3 convolution kernel is employed for multi-scale feature extraction, serving as the output. In the Cross-Spatial Learning module, the output of the 1 × 1 branch undergoes global average pooling, followed by global information encoding using the Softmax function. The result is then multiplied by the output of the 3 × 3 branch through element-wise multiplication, facilitating information interaction across channels. Simultaneously, the output of the 3 × 3 branch is globally encoded using the Softmax function and multiplied by the output of the 1 × 1 branch after Group Normalization, preserving precise spatial and positional features. The outputs of the two matrix multiplications are activated by the Sigmoid function and retain global information through re-weight adaptive feature variables, achieving pixel-level attention of different scales in the EMA attention mechanism for global contexts.

2.3.4. Experimental Setup and Evaluation Metrics for YOLOv8n-WSE-Pest Model Accuracy

To verify the accuracy of the YOLOv8n-WSE-pest model in pest identification, the hardware environment of this study is based on a 12th Gen Intel(R) Core(TM) i9-12900H 2.50 GHz processor, a 512 GB solid-state drive, and an NVIDIA GeForce RTX 3060 laptop GPU with 16 GB RAM. The software environment is built on the Windows 11 operating system, NVIDIA 528.24 driver, and CUDA version 11.3.1. The model development and testing were conducted using Python 3.7 and PyCharm 2017.

Extending beyond basic validation, the incorporation of a diverse set of evaluation metrics is pivotal for a thorough understanding of our enhanced model’s detection capabilities. This study employs a suite of metrics—precision, recall, F1 score, AP (average precision), mAP (mean average precision), and three types of loss functions—as benchmarks for gauging performance. Precision, as Equation (16) illustrates, is central to evaluating the model’s precision in classifying truly positive instances amidst all positive predictions. A heightened precision score signifies the model’s refined ability to correctly discern specific pest classes.

P r e c i s i o n = \frac{T P}{T P + F P}

(16)

Recall represents the ratio of correctly predicted positive samples to all actual positive samples. A higher recall indicates that the model can classify the majority of the classes, even though there may be occasional misclassifications. It is defined as Equation (17) shows.

R e c a l l = \frac{T P}{T P + F N}

(17)

Among these, TP represents the number of times the model predicts a positive sample as positive, FP represents the number of times it predicts a negative sample as positive, and FN represents the number of times it predicts a positive sample as negative. The F1 score is the harmonic mean of the precision and recall and is used to provide a more comprehensive reflection of the model’s performance. It is defined as Equation (18) shows.

F 1 = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(18)

mAP represents the AP across all classes. It is defined as Equation (19) shows.

m A P = \frac{\sum_{i = 1}^{C} A P (i)}{C}

(19)

In this study, a total of three loss functions were introduced to measure the object detection performance of the model, namely box_loss, cls_loss, and dfl_loss. box_loss is utilized to assess the IoU between the predicted bounding box and the true bounding box. The objective is to minimize the discrepancy between the predicted box and the true box by maximizing their IoU, as defined in Equation (20).

L_{b o x} = 1 - I o U (p r e d_{b o x}, t u r e_b o x)

(20)

cls_loss is employed to measure the accuracy of the model’s predictions regarding the target class by determining whether an object is present within a box. The definition is presented in Equation (21), where

y

is the true label (0 or 1), and

p

is the probability predicted as the positive class.

L_{c l s} = - [y l o g (p) + (1 - y) \log (1 - p)]

(21)

dfl_loss corrects the model’s errors in predicting the bounding boxes of objects, as defined in Equation (22), where

α_{t}

is the weight balancing positive and negative samples,

γ

is the focusing parameter that adjusts the weights of easily classified samples, and

p_{t}

is the probability predicted by the model for the target class.

L_{d f l} = - α_{t} {(1 - p_{t})}^{γ} l o g (p_{t})

(22)

3. Results

3.1. Analysis of Model Training Results

In this section, three loss functions, namely box_loss, cls_loss, and dfl_loss, are introduced to evaluate the accuracy of the model in predicting bounding boxes. These loss functions guide the training and optimization of the model, and lower values indicate a higher prediction accuracy. As shown in Figure 11, at the beginning of the training, both the training and validation loss functions decrease rapidly, but they exhibit oscillations before reaching 100 iterations. After 100 iterations, the loss function values stabilize. In particular, the YOLOv8n-WSE-pest model shows decreases of 7.46%, 22.79%, and 28.45% in box_loss, cls_loss, and dfl_loss values, respectively, compared to the original YOLOv8n network in the training set. In the validation set, the box_loss, cls_loss, and dfl_loss values decrease by 6.39%, 26.70%, and 36.15%, respectively, compared to the original network.

In this study, a comparison was conducted between the YOLOv8n-WSE-pest model and the original YOLOv8n model to assess the localization and prediction performances of the improved model. The comparison results, as shown in Figure 12, indicate that the improved model achieved a 3.16% increase in the precision, a 5.65% improvement in the recall, and a 4.43% improvement in the F1 score.

3.2. Experimental Analysis of Detection Model

3.2.1. Ablation Experiment

To further validate the accuracy of the proposed model in this study, several ablation experiments were conducted using different models, including YOLOv8n, YOLOv8n-W, YOLOv8n-S, YOLOv8n-E, YOLOv8n-WS, YOLOv8n-WE, YOLOv8n-SE, and YOLOv8n-WSE-pest. As shown in Table 2, YOLOv8n-W is a model that improves the original YOLOv8n by incorporating the WIoU-v3 loss function. YOLOv8n-S incorporates the SCConv module into the Backbone network of YOLOv8n to optimize the spatial and channel feature extraction. YOLOv8n-E introduces the EMA module into the Neck network of YOLOv8n to enhance model attention.

As shown in Table 3, compared to the original network, the introduction of the WIoU-v3 loss function in YOLOv8n-W increased the precision by 0.40%, recall by 2.49%, and mAP50 by 0.79%. The addition of the SCConv module improved the precision by 1.02%, recall by 2.77%, and mAP50 by 1.01%. The inclusion of the EMA attention mechanism increased the precision by 1.69%, recall by 6.63%, and mAP50 by 2.11%. This module had the most significant impact on precision improvement but also increased the parameters, gradients, and GFLOPs by 8,011,984, 8,011,984, and 20.2, respectively. When comparing YOLOv8n-SE with YOLOv8n-E, the former showed similar precision, recall, and mAP50 but significantly reduced parameters, gradients, and GFLOPs by 8,037,332, 8,037,332, and 20.7, respectively. EMA enhanced the model’s localization and prediction performance, but it also increased the model’s parameter count and complexity. On top of that, incorporating the SCConv module significantly reduced the parameter count during feature extraction while maintaining model accuracy.

Finally, compared to the YOLOv8n model, YOLOv8n-WSE-pest achieved a 3.12% improvement in the precision, a 5.65% improvement in the recall, and a 2.18% improvement in the mAP50.

To further enhance the interpretability of this deep learning model, Grad-CAM (Gradient-weighted Class Activation Mapping) is introduced in this study to provide visualizations of the prediction results. The principle of Grad-CAM involves performing global average pooling on the last convolutional layer of the network to calculate the importance of each feature map for a specific class and generate Class Activation Maps (CAM) [46]. Figure 13 shows the heatmaps for four different insect categories. The results indicate that YOLOv8n-WSE-pest produces darker colors for the predicted targets, suggesting that this model pays more attention to the regions of interest in the target images during prediction.

3.2.2. Comparative Model Experiment

To investigate the advantages of the improved model in detection, this study tested its performance under different image conditions. YOLOv8n-WSE-pest was compared against YOLOv8n, SSD, and Faster-RCNN. To ensure the reliability of the comparison results, the datasets used for the different models were consistent. The comparison results, as shown in Figure 14, were conducted on four insect categories: Toxoptera aurantii, Xyleborus fornicatus Eichhoffr, Empoasca pirisuga Matumura, and Malthodes discicollis Baudi di Selve. The testing scenarios included multiple objects, single objects, occluded objects, missing subjects, and interfering objects.

From the experimental results, it can be concluded that YOLOv8n-WSE-pest demonstrates good recognition accuracy under different lighting conditions, varying numbers of targets, and whether the subject is complete or not. The improved version of the model consistently outperforms other models in terms of accuracy across all conditions, while the original YOLOv8n model can recognize the subject in the majority of cases. Regarding the scenarios depicted in Figure 14a with small and multiple targets and occluded subjects, as well as Figure 14b with missing subjects, the improved model shows superior performance under these conditions, while SSD and Faster-RCNN fail to detect some of the targets. In Figure 14c, where the subject is interfered with by other insect species, and Figure 14d, where the image has high brightness, YOLOv8n-WSE-pest also exhibits a higher recognition accuracy compared to the other comparative models.

The AP values can be obtained by using precision and recall as the x-axis and y-axis, respectively, and calculating the area under the curve. The AP values and mAP values of the comparative models for the four insect categories are shown in Table 4. In terms of the Toxoptera aurantii category, this model achieved an AP improvement of 12.63%, 7.66%, and 3.18% compared to Faster-RCNN, SSD, and YOLOv8n, respectively. For Xyleborus fornicatus Eichhoffr, the model showed AP improvements of 14.33%, 9.88%, and 2.58% compared to the three models, respectively. Similarly, for Empoasca pirisuga Matumura, the improvements were 15.06%, 9.08%, and 1.7%, and for Malthodes discicollis Baudi di Selve, the improvements were 15.37%, 9.05%, and 1.26%. The model also outperformed the comparative models in terms of the mAP, with improvements of 14.34%, 8.85%, and 2.18%, respectively. The numerical visualization is shown in Figure 15, where YOLOv8n-WSE-pest achieved the highest AP values and mAP values among the four models.

4. Discussion

Compared to prior studies and conventional approaches, the proposed YOLOv8n-WSE-pest model demonstrates several significant advantages over traditional models. By incorporating WIoU-v3 into the loss function, the model effectively addresses the limitations of the original IoU metric in describing boundary differences and detecting small objects. This improvement allows for a more precise description of the boundary differences between predicted and ground truth boxes, thereby enhancing the detection performance. Additionally, the introduction of SCConv helps in reducing feature redundancy, which, in turn, lowers the model’s complexity while maintaining high precision. The Efficient Multi-Scale Attention (EMA) module further enhances the model’s ability to recognize small, multiple, and interfered targets by utilizing multi-scale pixel-level attention with global contexts. To conclude, these enhancements delivered outstanding performances during external validation. Notably, the precision improved by 3.12% compared to the original network while also reducing the parameter count by 25,348, thereby highlighting the model’s potential for deployment and application.

However, despite these advancements, the YOLOv8-WSE-pest model still faces challenges, particularly with complex image datasets. Variations in environmental conditions, such as the lighting and weather, can impact the model’s detection accuracy. Additionally, the presence of interfering factors, such as vegetation and unrelated pests, complicates the detection process. The SCConv module, while effective in reducing feature redundancy, does not entirely eliminate these challenges. Similarly, the EMA module, though it improves the recognition capability, may still struggle with scenes of high complexity and clutter. These limitations highlight the need for further refinement and optimization of the model. To optimize these limitations, future research will focus on collecting more data on environmental variations and pest species, which will aid the model in generalizing better under diverse conditions. Additionally, further optimizing the model’s structure and exploring the practical applications of deep learning models in agricultural science and technology, while adding functionalities such as pest counting and tracking, will be essential for the model’s practical deployment.

In summary, the YOLOv8n-WSE-pest model significantly improves detection performance through WIoU-v3, SCConv, and EMA. Subsequent studies should concentrate on addressing the current limitations through the exploration of advanced data augmentation techniques and optimizing the model for real-time applications in diverse environments. These efforts will contribute to providing more robust and practical solutions for the intelligent and automated management of agricultural pest control.

5. Conclusions

This study aims to address the challenges of intelligent pest management at tea plantations in Yunnan Province, China. It focuses on resolving issues such as insufficient detection accuracy, high model complexity, and poor regional adaptability of pest models in tea gardens. Accordingly, an efficient and accurate pest monitoring method is introduced, which is based on the original YOLOv8n model, having been improved using three additional modules to meet the needs of intelligent intervention for pest control in tea plantations by implementing the WIoU-v3 loss function, SCConv for feature redundancy reduction, and the EMA attention mechanism.

Experimental results indicate that the model’s accuracy, recall, and mAP reached 95.08%, 94.19%, and 97.95%, respectively. Compared to the original model, the improved version achieved these gains in primary precision metrics while also reducing the model’s parameter count and size, demonstrating its streamlined architecture. Comparisons with the original YOLOv8n and other classic models show that the proposed model achieves marked enhancements over the original YOLOv8 architecture, evidenced by a 3.12% rise in precision, a 5.65% increment in recall, a 2.18% boost in the mean average precision (mAP), and a 4.43% augmentation in the F1 score. Furthermore, in comprehensive evaluations against Faster R-CNN, SSD, and YOLOv8n, our model exhibits substantial mAP improvements of 14.34%, 8.85%, and 2.18%, respectively.

To conclude, YOLOv8n-WSE-pest demonstrates significant advantages over other classic object detection models in terms of both its lightweight design and high accuracy, making it critically valuable for future tea garden and agricultural pest management. Its lightweight yet precise design meets the demands of real-time precision, making it highly suitable for deployment on small-scale devices. In our subsequent sustainable work, we will combine more methods and IoT technologies to directly deploy micro cameras in the tea garden to capture more dynamic images of insects. Subsequent research endeavors will concentrate on acquiring broader datasets and integrating atmospheric environmental parameters such as humidity, light intensity, and wind speed, aiming to achieve multimodal fusion of image and environmental information. This integration aims to enhance the complexity of intelligent pest monitoring and management practices, thus advancing the field of applied science and smart agricultural technologies.

Author Contributions

Conceptualization and writing—original draft preparation, H.L.; methodology, W.Y.; software implementation and support, Y.X.; data validation and quality assurance, Z.W.; data curation and management, J.H.; writing—review, editing, and refinement, Q.W. and S.Z.; formal analysis and interpretation of results, L.L. and F.Y.; funding acquisition, B.W. All authors have read and agreed to the published version of the manuscript.

Funding

This study received funding from various sources, such as the Innovative Team for AI and Big Data Applications in Yunnan’s Tea Industry (202405AS350025), the grants for the Development and Demonstration of Intelligent Agriculture Data Sensing Technology and Equipment in Plateau Mountainous Areas (202302AE09002001), the Study of Yunnan Big Leaf Tea Tree Phenotypic Plasticity Characteristics Selection Mechanism Based on AI-driven Data Fusion (202301AS070083), the Smart Tea Industry Technology Task of Menghai County, Yunnan Province (202304BI090013), and the Yunnan Province Lancang County Xuelinsi Wa Ethnic Township, Nuofu Township Science and Technology Special Dispatch Team (202204BI090079).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Acknowledgments

The authors express profound gratitude to the editor and reviewers for their insightful feedback and recommendations, which significantly enhanced the quality of this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Xia, Y.; Wang, Z.; Cao, Z.; Chen, Y.; Li, L.; Chen, L.; Zhang, S.; Wang, C.; Li, H.; Wang, B. Recognition Model for Tea Grading and Counting Based on the Improved YOLOv8n. Agronomy 2024, 14, 1251. [Google Scholar] [CrossRef]
Wang, Z.; Yang, C.; Che, R.; Li, H.; Chen, Y.; Chen, L.; Yuan, W.; Yang, F.; Tian, J.; Wang, B. Assisted Tea Leaf Picking: The Design and Simulation of a 6-DOF Stewart Parallel Lifting Platform. Agronomy 2024, 14, 844. [Google Scholar] [CrossRef]
Jiang, C.; Zeng, Z.; Huang, Y.; Zhang, X. Chemical compositions of Pu’er tea fermented by Eurotium cristatum and their lipid-lowering activity. LWT 2018, 98, 204–211. [Google Scholar] [CrossRef]
Chen, N.; Zhu, J.; Zheng, L. Light-YOLO: A Study of a Lightweight YOLOv8n-Based Method for Underwater Fishing Net Detection. Appl. Sci. 2024, 14, 6461. [Google Scholar] [CrossRef]
Zhang, Z.; Zhou, C.; Xu, Y.; Huang, X.; Zhang, L.; Mu, W. Effects of intercropping tea with aromatic plants on population dynamics of arthropods in Chinese tea plantations. J. Pest Sci. 2017, 90, 227–237. [Google Scholar] [CrossRef]
Somnath, R.; Gautam, H.; Narayanannair, M.; Kavya, D.; Mukhopadhyay, R.S.; Ananda, M.; Azariah, B. Use of plant extracts for tea pest management in India. Appl. Microbiol. Biot. 2016, 100, 4831–4844. [Google Scholar]
Liao, Y.; Yu, Z.; Liu, X.; Zeng, L.; Cheng, S.; Li, J.; Tang, J.; Yang, Z. Effect of Major Tea Insect Attack on Formation of Quality-Related Nonvolatile Specialized Metabolites in Tea (Camellia sinensis) Leaves. J. Agr. Food Chem. 2019, 67, 6716–6724. [Google Scholar] [CrossRef]
Chen, X.; Hassan, M.M.; Yu, J.; Zhu, A.; Han, Z.; He, P.; Chen, Q.; Li, H.; Ouyang, Q. Time series prediction of insect pests in tea gardens. J. Sci. Food Agric. 2024, 104, 5614–5624. [Google Scholar] [CrossRef]
Blackie, H.M.; MacKay, J.W.; Allen, W.J.; Smith, D.H.V.; Barrett, B.; Whyte, B.I.; Murphy, E.C.; Ross, J.; Shapiro, L.; Ogilvie, S.; et al. Innovative developments for long-term mammalian pest control. Pest Manag. Sci. 2014, 70, 345–351. [Google Scholar] [CrossRef]
Schellhorn, N.A.; Parry, H.R.; Macfadyen, S.; Wang, Y.; Zalucki, M.P. Connecting scales: Achieving in-field pest control from areawide and landscape ecology studies. Insect Sci. 2015, 22, 35–51. [Google Scholar] [CrossRef]
Baruah, P.M.; Bardaloi, S.; Bordoloi, S. A comparative survey of the pest prevalence and chemical control practices in the tea gardens of Sonitpur district of Assam. Int. J. Phys. Soc. Sci. 2015, 5, 22–32. [Google Scholar]
He, J.; Zhang, S.; Yang, C.; Wang, H.; Gao, J.; Huang, W.; Wang, Q.; Wang, X.; Yuan, W.; Wu, Y.; et al. Pest recognition in microstates state: An improvement of YOLOv7 based on Spatial and Channel Reconstruction Convolution for feature redundancy and vision transformer with Bi-Level Routing Attention. Front. Plant Sci. 2024, 15, 1327237. [Google Scholar] [CrossRef] [PubMed]
Sun, L.; Cai, Z.; Liang, K.; Wang, Y.; Zeng, W.; Yan, X. An intelligent system for high-density small target pest identification and infestation level determination based on an improved YOLOv5 model. Expert Syst. Appl. 2024, 239, 122190. [Google Scholar] [CrossRef]
Li, H.; Shi, H.; Du, A.; Mao, Y.; Fan, K.; Wang, Y.; Shen, Y.; Wang, S.; Xu, X.; Tian, L.; et al. Symptom recognition of disease and insect damage based on Mask R-CNN, wavelet transform, and F-RNet. Front. Plant Sci. 2022, 13, 922797. [Google Scholar] [CrossRef] [PubMed]
Magsi, F.H.; Cai, X.; Luo, Z.; Li, Z.; Bian, L.; Xiu, C.; Fu, N.; Li, J.; Hall, D.R.; Chen, Z. Identification, synthesis, and field evaluation of components of the female-produced sex pheromone of Helopeltis cinchonae (Hemiptera: Miridae), an emerging pest of tea. Pest Manag. Sci. 2024. [Google Scholar] [CrossRef] [PubMed]
Jiang-Lin, Q.; Xiu-Hao, Y.; Zhong-Wu, Y.; Ji-Tong, L.; Xiu-Feng, L. New technology for using meteorological information in forest insect pest forecast and warning systems. Pest Manag. Sci. 2017, 73, 2509–2518. [Google Scholar]
Banik, A.; Chattopadhyay, A.; Ganguly, S.; Mukhopadhyay, S.K. Characterization of a tea pest specific Bacillus thuringiensis and identification of its toxin by MALDI-TOF mass spectrometry. Ind. Crops Prod. 2019, 137, 549–556. [Google Scholar] [CrossRef]
Li, J.; Zhou, Y.; Zhou, B.; Tang, H.; Chen, Y.; Qiao, X.; Tang, J. Habitat management as a safe and effective approach for improving yield and quality of tea (Camellia sinensis) leaves. Sci. Rep. 2019, 9, 433. [Google Scholar] [CrossRef]
Ju, C.C.; Yu, H.Y.; Shuo, L.Y.; Yu, C.C.; Min, H.Y. An AIoT Based Smart Agricultural System for Pests Detection. IEEE Access 2020, 8, 180750–180761. [Google Scholar]
Gao, D.; Sun, Q.; Hu, B.; Zhang, S. A Framework for Agricultural Pest and Disease Monitoring Based on Internet-of-Things and Unmanned Aerial Vehicles. Sensor 2020, 20, 1487. [Google Scholar] [CrossRef]
Ali, M.A.; Sharma, A.K.; Dhanaraj, R.K. Heterogeneous features and deep learning networks fusion-based pest detection, prevention and controlling system using IoT and pest sound analytics in a vast agriculture system. Comput. Electr. Eng. 2024, 116, 109146. [Google Scholar] [CrossRef]
Brunelli, D.; Albanese, A.; d’Acunto, D.; Nardello, M. Energy Neutral Machine Learning Based IoT Device for Pest Detection in Precision Agriculture. IEEE Internet Things Mag. 2019, 2, 10–13. [Google Scholar] [CrossRef]
Kiobia, D.O.; Mwitta, C.J.; Fue, K.G.; Schmidt, J.M.; Riley, D.G.; Rains, G.C. A Review of Successes and Impeding Challenges of IoT-Based Insect Pest Detection Systems for Estimating Agroecosystem Health and Productivity of Cotton. Sensor 2023, 23, 4127. [Google Scholar] [CrossRef] [PubMed]
Wang, Y.; Xu, R.; Bai, D.; Lin, H. Integrated Learning-Based Pest and Disease Detection Method for Tea Leaves. Forests 2023, 14, 1012. [Google Scholar] [CrossRef]
Xiaohu, Z.; Jingcheng, Z.; Yanbo, H.; Yangyang, T.; Lin, Y. Detection and discrimination of disease and insect stress of tea plants using hyperspectral imaging combined with wavelet analysis. Comput. Electron. Agric. 2022, 193, 106717. [Google Scholar]
Qingwen, G.; Chuntao, W.; Deqin, X.; Qiong, H. Automatic monitoring of flying vegetable insect pests using an RGB camera and YOLO-SIP detector. Precis. Agric. 2022, 24, 436–457. [Google Scholar]
Cheng, X.; Zhang, Y.; Chen, Y.; Wu, Y.; Yue, Y. Pest identification via deep residual learning in complex background. Comput. Electron. Agric. 2017, 141, 351–356. [Google Scholar] [CrossRef]
Yunong, T.; Shihui, W.; En, L.; Guodong, Y.; Zize, L.; Min, T. MD-YOLO: Multi-scale Dense YOLO for small target pest detection. Comput. Electron. Agric. 2023, 213, 108233. [Google Scholar]
Dong, S.; Teng, Y.; Jiao, L.; Du, J.; Liu, K.; Wang, R. ESA-Net: An efficient scale-aware network for small crop pest detection. Expert Syst. Appl. 2024, 236, 121308. [Google Scholar] [CrossRef]
Tang, Z.; Chen, Z.; Qi, F.; Zhang, L.; Chen, S. Pest-YOLO: Deep Image Mining and Multi-Feature Fusion for Real-Time Agriculture Pest Detection; IEEE: New York, NY, USA, 2021; pp. 1348–1353. [Google Scholar]
Chen, Z.; Zhou, H.; Lin, H.; Bai, D. TeaViTNet: Tea Disease and Pest Detection Model Based on Fused Multiscale Attention. Agronomy 2024, 14, 633. [Google Scholar] [CrossRef]
Shihao, Z.; Hekai, Y.; Chunhua, Y.; Wenxia, Y.; Xinghui, L.; Xinghua, W.; Yinsong, Z.; Xiaobo, C.; Yubo, S.; Xiujuan, D.; et al. Edge Device Detection of Tea Leaves with One Bud and Two Leaves Based on ShuffleNetv2-YOLOv5-Lite-E. Agronomy 2023, 13, 577. [Google Scholar] [CrossRef]
Chataut, R.; Phoummalayvane, A.; Akl, R. Unleashing the Power of IoT: A Comprehensive Review of IoT Applications and Future Prospects in Healthcare, Agriculture, Smart Homes, Smart Cities, and Industry 4.0. Sensor 2023, 23, 7194. [Google Scholar] [CrossRef] [PubMed]
Tan, S.-Y.; Hong, F.; Ye, C.; Wang, J.-J.; Wei, D. Functional characterization of four Hsp70 genes involved in high-temperature tolerance in Aphis aurantii (Hemiptera: Aphididae). Int. J. Biol. Macromol. 2022, 202, 141–149. [Google Scholar] [CrossRef] [PubMed]
Laijin, L. Be on Guard against one of the Destructive Pest: Xyleborus fornicatus. Plant Prot. Technol. Ext. 1999, 19, 23. [Google Scholar]
Guo, Z.M.; Cheng, F.Y.; Ma, M.J.; Tu, X.L. Advances in green control technology of Empoasca pirisuga Matumura in tea region of south Hubei province. Hubei Agric. Sci. 2019, 58, 9–12. [Google Scholar]
Deka, B.; Babu, A. Tea Pest Management: A Microbiological Approach. Appl. Microbiol. Open Access 2021, 7, 206. [Google Scholar]
Ivan, B.; Oscar, A.; Cristina, C.; Josep, P.; Stéphane, B.; Lorena, G.; Nuria, A. Development of a multi-primer metabarcoding approach to understanding trophic interactions in agroecosystems. Insect Sci. 2021, 29, 1195–1210. [Google Scholar]
Li, X.Y.; Xie, M.; Dong, P.X.; Liang, X.C. Morphology of Pyrocoelia pygidialis Pic(Coleoptera: Lampyridae) with Notes on Its Biology. J. Insect Classif. 2008, 30, 300–308. [Google Scholar]
Hussain, M. YOLO-v1 to YOLO-v8, the rise of YOLO and its complementary nature toward digital manufacturing and industrial defect detection. Machines 2023, 11, 677. [Google Scholar] [CrossRef]
Luo, Z.; Wang, C.; Qi, Z.; Luo, C. LA_YOLOv8s: A lightweight-attention YOLOv8s for oil leakage detection in power transformers. Alex. Eng. J. 2024, 92, 82–91. [Google Scholar] [CrossRef]
Chenghao, W.; Zhongqiang, L.; Ziyuan, Q. Transformer oil leakage detection with sampling-WIoU module. J. Supercomput. 2023, 80, 7349–7368. [Google Scholar]
Yi-Fan, Z.; Weiqiang, R.; Zhang, Z.; Zhen, J.; Liang, W.; Tieniu, T. Focal and efficient IOU loss for accurate bounding box regression. Neurocomputing 2022, 506, 146–157. [Google Scholar]
Tong, Z.; Chen, Y.; Xu, Z.; Yu, R. Wise-IoU: Bounding box regression loss with dynamic focusing mechanism. arXiv 2023, arXiv:2301.10051. [Google Scholar]
Li, J.; Wen, Y.; He, L. in Scconv: Spatial and Channel Reconstruction Convolution for Feature Redundancy. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 6153–6162. [Google Scholar]
Hu, J.; Fan, T.; Tang, X.; Yang, Z.; Ren, Y. Nonlinear relations of urban morphology to thermal anomalies: A cross-time comparative study based on Grad-CAM and SHAP. Ecol. Indic. 2024, 162, 112024. [Google Scholar] [CrossRef]

Figure 1. Collection of original images of pests.

Figure 2. Partial original image samples.

Figure 3. Data augmentation.

Figure 4. Examples of low-quality images. Note: (a) illustrates severe overexposure and blur, which can obscure essential details; (b) presents an instance where the subject is unidentifiable, potentially due to occlusion or poor resolution; (c) shows a case of missing targets, which is critical, as it pertains to the absence of the object of interest within the frame; (d) displays reflective or mutilated images, which can lead to misinterpretation by the detection model.

Figure 5. Improved YOLOv8n network architecture. Note: Figure 5 illustrates the network architecture of YOLOv8n-WSE-pest in this study. The upper part depicts the hierarchical structure of the network, while the right side elucidates the working principle of the original components within the network. Specifically, the Binary Cross-Entropy (BCE) loss function is employed for binary classification tasks, quantifying the discrepancy between the predicted and actual class labels. The Distribution Focal Loss (DFL) transforms the bounding box regression problem in object detection into a sequence prediction problem, thereby enhancing the detection accuracy in scenarios where the targets exhibit boundary ambiguity or occlusion. Additionally, the shortcut operation within the Bottleneck component facilitates skip connections in feature maps, contributing to feature fusion and the stability of network training.

Figure 6. WIoU-v3 schematic diagram. Note: On the left side of the figure are anchor boxes of three different qualities. On the right side, the method is demonstrated to more accurately assess the quality of anchor boxes by predicting the relative position and size differences between the predicted boxes and the ground truth boxes.

Figure 7. Diagram of SCConv’s overall structure.

Figure 8. SRU schematic diagram.

Figure 9. CRU schematic diagram.

Figure 10. EMA algorithm principle diagram.

Figure 11. Model loss function: (a–c) represent the trend graphs of box_loss, cls_loss, and dfl_loss for the YOLOv8n-WSE-pest model, respectively; (d–f) represent the trend graphs of box_loss, cls_loss, and dfl_loss for the original YOLOv8n model, respectively; (a,d) compare the box_loss between the validation set and the training set; (b,e) represent the cls_loss comparison between the validation set and the training set; (c,f) depict the dfl_loss comparison between the validation set and the training set.

Figure 12. Comparison between YOLOv8n-WSE-pest vs. original YOLOv8n. Note: This graphical representation showcases a comparative assessment of the foundational YOLOv8 model against a tailored variant, YOLOv8-WSE-pest, which is specifically adapted for pest recognition tasks. The depicted analysis spans three pivotal performance metrics—precision, recall, and F1 score—across a spectrum of confidence threshold values. Each distinct curve within the illustration corresponds to a separate pest category, while the overarching blue line signifies the aggregated average performance across all pest types considered. Progressing along the horizontal axis, which denotes the incremental confidence thresholds, the vertical axis records the respective scores of the outlined evaluative criteria. This meticulous comparison serves to highlight the augmented effectiveness of the YOLOv8-WSE-pest model in consistently identifying and classifying diverse pest species under a broad range of confidence levels, thereby affirming its advancement in specialized detection capabilities.

Figure 13. Grad-CAM results for pest identification.

Figure 14. Comparison of actual image detection.

Figure 15. AP recognized using different models in different categories.

Table 1. Dataset partitioning.

Pest Species	Training Dataset	Verification Dataset	Test Dataset
Toxoptera aurantii	801	264	267
Xyleborus fornicatus Eichhoffr	795	269	271
Empoasca pirisuga Matumura	792	258	262
Malthodes discicollis Baudi di Selve	806	273	265
Total	3194	1064	1065

Table 2. An overview of the performance characteristics of our different YOLOv8n versions.

Algorithm	WIoU-v3	SCConv	EMA	Characteristics
YOLOv8n	$\times$	$\times$	$\times$	Baseline
YOLOv8n-W	$\sqrt$	$\times$	$\times$	Precision focused
YOLOv8n-S	$\times$	$\sqrt$	$\times$	Efficiency
YOLOv8n-E	$\times$	$\times$	$\sqrt$	High performance
YOLOv8n-WS	$\sqrt$	$\sqrt$	$\times$	Balanced
YOLOv8n-WE	$\sqrt$	$\times$	$\sqrt$	Enhanced
YOLOv8n-SE	$\times$	$\sqrt$	$\sqrt$	Simplified
YOLOv8n-WSE-pest	$\sqrt$	$\sqrt$	$\sqrt$	Optimized

Note: The table presents the characteristics of various YOLOv8n algorithm versions, indicating the presence (√) or absence (×) of specific features. WIoU-v3 is a precision-focused module, SCConv is designed for efficiency, and EMA is associated with high performance. The “Baseline” version, YOLOv8n, lacks these modules. The suffixes “-W”, “-S”, “-E”, “-WS”, “-WE”, “-SE”, and “-WSE-pest” denote different combinations of these modules, with each version aiming to achieve a particular balance or optimization in performance. For instance, YOLOv8n-WSE-pest combines all three modules for the most optimized version.

Table 3. Results of ablation experiment.

Algorithm	Precision/%	Recall/%	mAP50/%	Layers	Parameters	Gradients	GFLOPs
YOLOv8n	91.96	88.54	95.77	225	3,157,200	3,157,184	8.9
YOLOv8n-W	92.36	91.03	96.56	225	3,157,200	3,157,184	8.9
YOLOv8n-S	92.98	91.31	96.78	236	3,131,180	3,131,164	8.3
YOLOv8n-E	93.65	95.17	97.88	233	11,169,184	11,169,168	29.1
YOLOv8n-WS	92.67	91.93	96.47	236	3,131,180	3,131,164	8.3
YOLOv8n-WE	94.14	96.41	97.16	233	11,169,184	11,169,168	29.1
YOLOv8n-SE	94.29	95.75	97.42	244	3,131,852	3,131,836	8.4
YOLOv8n-WSE-pest	95.08	94.19	97.95	244	3,131,852	3,131,836	8.4

Table 4. Comparing the AP values of various insect species in the model.

Model Name	AP of Toxoptera aurantii/%	AP of Xyleborus fornicatus Eichhoffr/%	AP of Empoasca pirisuga Matumura/%	AP of Malthodes discicollis Baudi di Selve/%	mAP/%
Faster-RCNN	84.62	83.78	82.76	83.25	83.61
SSD	89.84	88.23	88.74	89.57	89.10
YOLOv8n	94.07	95.53	96.12	97.36	95.77
YOLOv8n-WSE-pest	97.25	98.11	97.82	98.62	97.95

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, H.; Yuan, W.; Xia, Y.; Wang, Z.; He, J.; Wang, Q.; Zhang, S.; Li, L.; Yang, F.; Wang, B. YOLOv8n-WSE-Pest: A Lightweight Deep Learning Model Based on YOLOv8n for Pest Identification in Tea Gardens. Appl. Sci. 2024, 14, 8748. https://doi.org/10.3390/app14198748

AMA Style

Li H, Yuan W, Xia Y, Wang Z, He J, Wang Q, Zhang S, Li L, Yang F, Wang B. YOLOv8n-WSE-Pest: A Lightweight Deep Learning Model Based on YOLOv8n for Pest Identification in Tea Gardens. Applied Sciences. 2024; 14(19):8748. https://doi.org/10.3390/app14198748

Chicago/Turabian Style

Li, Hongxu, Wenxia Yuan, Yuxin Xia, Zejun Wang, Junjie He, Qiaomei Wang, Shihao Zhang, Limei Li, Fang Yang, and Baijuan Wang. 2024. "YOLOv8n-WSE-Pest: A Lightweight Deep Learning Model Based on YOLOv8n for Pest Identification in Tea Gardens" Applied Sciences 14, no. 19: 8748. https://doi.org/10.3390/app14198748

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

YOLOv8n-WSE-Pest: A Lightweight Deep Learning Model Based on YOLOv8n for Pest Identification in Tea Gardens

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Collection

2.2. Data Augmentation

2.3. YOLOv8 Network Improvement

2.3.1. Improvement of Loss Function

2.3.2. Spatial and Channel Reconstruction Convolution for Feature Redundancy

2.3.3. Efficient Multi-Scale Attention Module with Cross-Spatial Learning

2.3.4. Experimental Setup and Evaluation Metrics for YOLOv8n-WSE-Pest Model Accuracy

3. Results

3.1. Analysis of Model Training Results

3.2. Experimental Analysis of Detection Model

3.2.1. Ablation Experiment

3.2.2. Comparative Model Experiment

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI