1. Introduction
As one of the key modern agricultural industries in Yunnan’s plateau region, the tea industry serves not only as a carrier of traditional culture but also as a crucial component of contemporary rural revitalization. In recent years, climate change and alterations in cultivation methods have led to frequent and widespread pest outbreaks in tea plantations, severely constraining the development of the tea industry. Currently, pest management [
1,
2] in tea plantations primarily relies on chemical control, biological control, physical control, and agronomic management [
3]. However, these traditional methods are often greatly affected by the environment, slow in action, and high in implementation costs, requiring a lot of manpower and material resources [
4]. Therefore, effectively integrating intelligent technologies and data analysis to achieve precise pest control and resource optimization has become a critical issue for the sustainable development of the tea industry [
5].
In order to realize the intelligent and accurate prevention and control of tea garden pests [
6], the first thing to be solved is the accurate detection and positioning of insects. With the rapid development of neural network [
7] technologies, pest detection in tea plantations has gradually become more intelligent, with deep neural networks and convolutional neural networks increasingly applied in pest image processing. Object detection technologies such as YOLO (You Only Look Once version 10) and Faster R-CNN have provided efficient solutions for automatic pest localization.
Yue Yu et al. [
8] developed the LP-YOLO(s) network based on YOLOv8 by replacing certain network modules with LP_Unit and LP_DownSample while integrating the Efficient Channel and Spatial Attention mechanism. This optimized the network structure and performance, reducing model parameters by 70.2% while improving model accuracy by 40.7%, despite only a 0.8% drop in mAP (mean Average Precision).
For detecting Neotropical brown stink bugs, Bruno Pinheiro de Melo Lima et al. [
9] proposed an improved YOLOv8 network by introducing P2 and C2f2 modules to optimize the original YOLOv8 structure. They also integrated the ByteTrack algorithm for automatic insect counting. Ablation experiments showed that compared to the original YOLOv8, the YOLOv8n-P2-C2f2 network achieved mAP0.5 and mAP0.95 improvements of 9.6% and 4.4%, respectively, with only a 0.5 G increase in model parameters.
In addition, our research team proposed an improved YOLO algorithm for tea plantation pest detection based on YOLOv7 [
10]. This enhancement involved introducing MPDIoU (Maximum Possible DloU) to optimize the loss function, employing spatial-channel reorganization convolution to improve the backbone network, and integrating the Vision Transformer with Bi-Level Routing Attention to further optimize the network structure. Experimental results indicated that the improved YOLOv7 network achieved Precision, Recall, F1-score, and mAP improvements of 5.68%, 5.14%, 5.41%, and 2.58%, respectively, while reducing model parameters by 1.39 G. However, the complex environment of tea plantations remains a significant challenge. Uneven lighting, leaf occlusion, diverse background textures, and small insect sizes blending into their surroundings frequently compromise detection accuracy, causing missed or false detections in complex pest scenarios.
Given that pest targets in tea plantations are typically small and often blend into complex backgrounds, conventional image acquisition devices face significant limitations in capturing target details and extracting features. To enhance data collection accuracy and image quality, this study employs microscopic lenses for data acquisition. The research subjects include
Toxoptera aurantii (Boyer de Fonscolombe, 1841),
Xyleborus fornicatus Eichhoffr (Eichhoff, 1875),
Arboridia apicalis (Nawa, 1917), and
Empoasca pirisuga Matumura (Matumura, 1931). These insects damage tea tree leaves, tender buds, and branches through piercing and boring, leading to secondary diseases that severely affect the yield and quality of tea. Furthermore, the size of these insects ranges from only 1.5 to 3 mm, making them difficult for existing insect detection algorithms to perceive effectively. Addressing current challenges in pest detection research, this study proposes a deep learning model, I-YOLOv10-SC, based on Space-to-Depth Convolution, the Convolutional Block Attention Module, Shape Weights, and Scale Adjustment Facto rs. Its core innovation lies in achieving notable advancements in detecting small and incomplete pest targets through multi-level feature aggregation and scale-adaptive optimization. Space-to-Depth Convolution [
11] enhances the network’s capability to detect small insect targets while reducing detail loss during downsampling [
12]. The Convolutional Block Attention Module improves feature representation and attention focus. Shape Weights and Scale Adjustment Factors optimize the loss function, boosting bounding box prediction accuracy, reducing false and missed detections for small targets, and accelerating model convergence.
To enhance the model’s interpretability, facilitate understanding among agricultural managers, and improve its trustworthiness and practicality in real-world applications, Grad-CAM visualization analysis [
13] is further introduced to demonstrate the model’s detection process. Through the detection enabled by this model, managers can monitor the dynamic changes in pest species within tea plantations in real time. Additionally, by analyzing the fluctuation trends in the detected insect populations, they can accurately track the population dynamics of pests in the tea plantation, providing scientific evidence for monitoring pest population dynamics. This study aims to provide an efficient and precise pest detection model for the development of intelligent tea plantations in Yunnan [
14], offering technical support and implementation pathways for intelligent pest monitoring [
15] and precise control, thus advancing the intelligent development of the tea industry.
2. Materials and Methods
2.1. Image Acquisition and Dataset Construction
To realistically simulate the tea plantation environment, all datasets used in this study were collected through field photography. These data were collected between May and October 2024. The primary collection sites were the Laobanzhang and Hekai bases in Menghai County (100° E, 21° N), Xishuangbanna Prefecture, Yunnan Province. External validation data were gathered from the tea plantation behind Yunnan Agricultural University (102° E, 25° N).
The captured pest images include two types of backgrounds: leaf backgrounds and yellow sticky trap backgrounds. A macro lens was used for image magnification, with a magnification factor of 200×, a focal length of 40.6 mm, a focusing distance of 0.3 mm, and a brightness range of 85–150 lumens. To enhance the model’s adaptability across different devices, image acquisition was conducted using various smartphones, including the iPhone 15 Pro Max, Huawei Nova 12, Redmi K60, and Vivo S19 (All smartphones are made in China).
A total of 3126 original samples were collected in this study, including 1297 in the Laobanzhang base, 1365 in the Hekai base, and 464 in the Houshan tea garden of Yunnan Agricultural University, including
T. aurantii,
X. fornicatus,
A. apicalis, and
E. pirisuga, with different angles and backgrounds. After preliminary screening, 3000 images were selected to build the dataset, with annotations performed using Make Sense. A total of 6143 labels were generated from these images, as shown in
Figure 1.
Panel A illustrates the histogram of the number of samples for each class. Panel B shows the width and height distributions of bounding boxes after aligning all label coordinates to the same position. Panel C displays the distribution of x and y coordinates within the images. Panel D represents the aspect ratio of label widths and heights, while Panel E provides detailed label distribution statistics from the original dataset. For external validation, 453 images from the tea plantation behind Yunnan Agricultural University were selected to assess the model’s generalization capability. The remaining 2547 images were randomly divided into a training set (2038 images) and a test set (509 images) using an 8:2 ratio.
2.2. Data Augmentation
In order to further improve the detection accuracy and generalization ability of the pest detection model, reduce the over-fitting phenomenon, enhance the adaptability of the model in the new environment, and better extract the characteristics of insects in different environments and angles, this study further introduces data enhancement technology to expand the training set [
16]. As shown in
Figure 2, to enhance the model’s adaptability to changes in insect positions and orientations, geometric transformation methods such as rotation, scaling, cropping, and flipping were applied to the images, with random settings for rotation angles, scaling factors, and cropping ratios. To improve the model’s performance under varying brightness and color conditions, random adjustments were made to image brightness, contrast, saturation, and hue. To further reduce background-induced interference and enhance the model’s generalization ability in different environments, background blurring techniques were applied to the images. Additionally, to enhance the model’s ability to detect occluded or partially damaged insects, random black occlusions were added to certain areas of the images, with the size and position of the occlusions set randomly.
2.3. YOLOv10 Network Improvement
The YOLOv10 network structure primarily consists of three components: Backbone, Neck, and Head. The Backbone is responsible for extracting rich features from input images to generate high-quality feature maps. The Neck focuses on multi-scale feature fusion from the Backbone, while the Head generates the final detection results. Although YOLOv10 demonstrates excellent performance in multi-scale feature fusion and bounding box prediction, it still faces significant limitations when dealing with small targets, low-resolution images, and incomplete insects.
To address these issues, this study uses the YOLOv10 network as the base model. To enhance its capability to detect small insects and mitigate severe detail loss, the Backbone is structurally optimized using Space-to-Depth Convolution [
17]. Considering the original network’s limited accuracy in detecting small and incomplete insects due to its reliance on global features, the Convolutional Block Attention Module [
18] is applied to specifically improve the small-object detection layer. Additionally, to further enhance bounding box prediction accuracy, reduce false and missed detections of small targets, and accelerate model convergence, Shape Weights and Scale Adjustment Factors are introduced to optimize the loss function. The improved YOLOv10 network structure is illustrated in
Figure 3, with detailed parameters listed in
Table 1. In
Table 1, SPPF (Spatial Pyramid Pooling—Fast) mainly performs pooling operations at different scales on feature maps to obtain multi-scale contextual information, thereby enhancing the model’s ability to perceive targets of various sizes.
2.3.1. Space-to-Depth Convolution Optimization
In the pest detection task, although the traditional YOLOv10 network has high real-time performance and detection ability, it still has some significant defects in dealing with small targets, complex backgrounds, and low-resolution images. After research, these defects are mainly due to the stride convolution and pooling operations [
19] in its infrastructure. These operations inevitably lose fine-grained information during the downsampling process, resulting in a significant decrease in the detection accuracy of small insects. Because of the dense distribution of a large number of insects, the detection ability of the YOLOv10 network will also be greatly affected, and it is very easy to have missed detection and false detection problems.
To address the premature loss of small-target features caused by conventional downsampling methods, this study applies Space-to-Depth Convolution to structurally optimize the Backbone, ensuring complete feature transmission and effectively enhancing small-target detection accuracy.
As illustrated in
Figure 4, Space-to-Depth primarily comprises two core components: SPD (Space-to-Depth) and Non-strided Convolution. The primary function of the SPD layer is to split the input feature map into multiple sub-feature maps based on a specified stride and concatenate these sub-feature maps along the channel dimension, forming a new feature map. This operation effectively encodes spatial information into additional channel dimensions, thereby reducing the spatial size while preserving all information. As shown in Equation (1), where
denotes the input feature map dimensions and
represents the SPD layer’s splitting formula,
indicates the spatial resolution of the feature map and
denotes the number of channels. During the splitting process, the SPD layer partitions the input feature map along rows and columns according to the specified stride. For example, when the stride length is 2, four sub-feature maps of size
will be generated. During concatenation along the channel dimension, the new feature map size after concatenation is as described in Equation (1):
After passing through the SPD layer, the feature map’s channel number increases significantly. To further reduce the model’s computational cost while extracting discriminative features, a Non-strided Convolution layer is introduced after the SPD layer for channel dimension reduction. As shown in Equation (2),
represents the output feature map after channel reduction. To precisely retain discriminative features and enhance the model’s perceptual ability, the channel reduction process incorporates a Non-strided Convolution with
filters, ensuring efficient feature extraction while maintaining key information:
2.3.2. Convolutional Block Attention Module Optimization
Research findings indicate that while the original YOLOv10 [
20,
21] network demonstrates strong object detection capabilities, it still faces notable limitations when detecting small targets and incomplete (partially occluded or fragmented) insects. These challenges arise because the downsampling operation in the YOLOv10 network excessively compresses small-target features, reducing detection accuracy. Additionally, the network relies heavily on global features for target recognition.
To address these issues and enhance the YOLOv10 network’s detection accuracy for small targets and incomplete insects, as well as improve feature extraction capability and robustness, this study incorporates the Convolutional Block Attention Module into the network’s small-object detection layer.
As shown in
Figure 5, as a lightweight attention mechanism, CBAM (Convolutional Block Attention Module) [
22] is composed of the Channel Attention Module and Spatial Attention Module. It can significantly enhance the feature extraction ability of the model by adaptively adjusting the attention distribution of the model in the channel dimension and the spatial dimension.
In the CBAM, the CAM (Channel Attention Module) extracts both global and local information from the feature map along the channel dimension using global average pooling and max pooling operations. The resulting pooled features are then passed through a MLP (multi-layer perceptron) [
23] for further processing. As shown in Equation (3), F represents the input feature map and σ denotes the sigmoid activation function. The final channel attention map generated by CAM is applied to adjust the feature responses of each channel, effectively enhancing channels that carry critical pest-related features such as insect texture and color.
The SAM (Spatial Attention Module) generates two spatial feature maps of size
by performing max pooling and average pooling along the channel dimension. These two feature maps are then concatenated along the channel axis and processed through a convolutional layer to extract spatial attention features. As shown in Equation (4),
represents a 7 × 7 convolution operation. This process allows the SAM to effectively focus attention on regions where insects are located, thereby improving the localization accuracy of small and incomplete insects:
2.3.3. Loss Function Optimization
In the YOLOv10 network, the Bounding Box Regression Loss [
24] primarily relies on the relative positions and shapes between predicted and ground-truth bounding boxes while often overlooking the geometric properties of the bounding boxes themselves. This limitation leads to reduced regression accuracy when dealing with insects with significant shape variations or size changes, ultimately affecting the model’s detection performance.
To address this issue, improve detection accuracy for small insects, enhance model robustness, and accelerate convergence, this study introduces Shape Weights and Scale Adjustment Factors to optimize the loss function of the YOLOv10 network. This method incorporates the geometric shape, aspect ratio, and scale change of the model bounding box into the model training process to more accurately describe the matching degree between the prediction box and the real box and enhance the detection performance of the model for complex shape targets. Consequently, the model’s detection performance for targets with complex shapes is enhanced. As shown in Equation (5), the calculation of the Shape-Sensitive Distance [
25,
26] in the improved loss function is weighted based on aspect ratio differences along the horizontal and vertical directions. In this context,
and
represent the center coordinates of the predicted bounding box, while
and
denote the center coordinates of the ground-truth box. The diagonal length of the minimum enclosing box is represented by c, while
and
indicate the shape weights in the horizontal and vertical directions, respectively:
For the calculation of the Shape Error Term, as shown in Equation (8), the shape error
represents the cumulative shape error of the bounding box along the horizontal and vertical directions. The term θ is used to amplify the impact of the error, while
and
denote the shape error coefficients in the horizontal and vertical directions, respectively. The optimized loss function is computed as described in Equation (9), where IoU represents the Intersection over Union between the predicted and ground-truth bounding boxes. This enhanced formulation incorporates both geometric shape and scale considerations, allowing the model to perform more accurate bounding box regression, especially for insects with varying shapes and sizes:
2.4. Model Evaluation Metrics and Training Configuration
To further evaluate the performance of the improved YOLOv10 network in tea plantation pest detection tasks, this study introduces Precision, Recall, F1-score (balanced score), and mAP as performance evaluation metrics. Precision represents the proportion of correctly detected pests among all targets predicted as a specific pest class by the model. Recall indicates the proportion of actual pest targets successfully detected by the model among all ground-truth pests. The F1 score is the harmonic mean of accuracy and recall, serving as a comprehensive measure of the model’s detection capability. As shown in Equations (12)–(16), T
P represents the number of correctly detected pests, F
P represents the number of incorrectly detected pests, and F
N represents the number of pests missed by the model. Additionally, AP (Average Precision) [
27] refers to the average accuracy of a certain category under different IoU thresholds, which is a comprehensive index to evaluate the positioning accuracy and prediction accuracy. The size of the AP is determined by the accuracy and recall rate of the model. It is the area under a certain type of PR curve in all predicted pictures (the horizontal axis is Recall, and the vertical axis is Precision), and mAP is the average of all types of AP.
are the Recall values corresponding to the first interpolation points of the Precision interpolation segments, arranged in ascending order:
To evaluate the performance of the improved YOLOv10 network in tea plantation pest detection, this study conducted three sets of comparative experiments using four object detection networks: I-YOLOv10-SC (where “I” indicates IoU optimization, “S” represents Space-to-Depth Convolution optimization, and “C” stands for Convolutional Block Attention Module optimization), the original YOLOv10, Faster R-CNN, and SSD. Model training and testing were performed on the same dataset under identical hardware and software configurations to ensure scientific rigor and reliability of the test results.
The operating system used in this study was Windows 10, with model training conducted in GPU mode. The main machine configuration included a 12th Gen Intel(R) Core(TM) i5-12600KF 3.70 GHz processor, a 1TB hard drive, and a Colorful NVIDIA GeForce RTX 4060Ti Ultra W OC 16 G graphics card, running NVIDIA-SMI 561.09 with CUDA version 12.6. The manufacturer of the equipment is Wuhan Qicaihong Company, Wuhan, Hubei Province, China. The network development environment was Python 3.9 and PyCharm 2024. During training, the batch size was uniformly set to 64, and the number of epochs was 500.
To further improve model performance and stability, hyperparameter tuning strategies were applied to both the original YOLOv10 and the I-YOLOv10-SC network. In the process of model training, to effectively prevent overfitting or underfitting, the weight decay parameter was set to 0.0005, controlling model complexity, suppressing unnecessary parameter growth, and enhancing generalization. In order to accelerate the convergence speed and stability of the model, the preheating period is set to 3 to ensure that the model can converge steadily at the beginning of training and avoid performance oscillation caused by too fast update. In addition, in order to further optimize the convergence process of the model, the preheating initial momentum and the preheating initial bias learning rate are set to 0.8 and 0.1, respectively.
Regarding the loss function design, the bounding box loss gain was set to 7.5 to strengthen bounding box regression optimization and improve target localization accuracy [
28]. The classification loss gain was set to 0.5 to balance the model’s target class recognition with other loss terms. Finally, the keypoint object loss gain was set to 1.0, ensuring a reasonable contribution from the keypoint prediction task to the overall loss function.
4. Discussion
The I-YOLOv10-SC model proposed in this study is a deep learning model designed for the detection of small and incomplete insect targets in tea gardens, optimized to improve precise detection performance. The results indicate that compared to the original YOLOv10, as well as networks such as Faster R-CNN and SSD, I-YOLOv10-SC shows significant improvements in both detection accuracy and model efficiency. Compared with the Hypertuned-YOLO based on EigenCAM developed by Stefano Frizzo Stefenon et al., the F1 score is increased by 10.53% and the mAP is increased by 6.57% [
32], Compared with the YOLOu-Quasi-ProtoPNet network based on DenseNet-161, the F1 score is increased by 2.06%. These results fully demonstrate that I-YOLOv10-SC not only has stronger target detection ability, but also outperforms most existing advanced models in terms of performance indicators [
33].
One of the core innovations of this model is the introduction of Space-to-Depth convolution, which successfully addresses the common problem of feature loss during downsampling operations in traditional networks. This improvement allows I-YOLOv10-SC to better preserve the fine-grained features of small insect targets, enhancing its ability to detect pests that are small in size and set against complex backgrounds.
Additionally, the inclusion of the Convolutional Block Attention Module (CBAM) further enhances detection accuracy, especially when facing small or partially occluded insects. CBAM selectively focuses attention on the most critical features, helping the model concentrate on key pest-related attributes, thus improving its ability to handle partial occlusion and fragmented appearances of insects. By integrating Shape Weights and Scale Adjustment Factors into the loss function, the model’s detection performance is significantly improved. These adjustments allow the model to better match the predicted bounding boxes with the actual pest locations, reducing false positives and improving localization accuracy.
The I-YOLOv10-SC model has significant potential for intelligent pest monitoring in tea gardens. By precisely identifying and locating pests, the model can monitor the species and number of pests in real-time, providing accurate decision support to farmers and reducing the overuse of pesticides. Farmers can intervene locally in specific areas based on the actual detection results, not only reducing pesticide use but also minimizing pollution to soil and water sources. Furthermore, accurate pest monitoring helps protect the ecological environment and reduce harm to beneficial organisms, promoting the sustainability of agricultural production. This study lays a solid foundation for the advancement of precision pest control technologies and provides strong support for the development of smart agricultural technologies, contributing to the green transformation of agriculture and the achievement of sustainable development goals.
5. Conclusions
The I-YOLOv10-SC network was developed using a collaborative optimization method, greatly enhancing the accuracy and generalization for detecting small insects. This provides a new solution for the challenges of detecting small objects in object detection tasks. By incorporating Space-to-Depth Convolution into the YOLOv10 backbone, the model reduces detail loss in faraway and low-resolution images, improving its ability to detect small objects and accurately predict bounding boxes. The CBAM enhances the detection of small objects, helping the network better locate and identify small and incomplete insects. It also introduces Shape Weights and Scale Adjustment Factors in the loss function, which improves the accuracy of bounding box predictions and speeds up model training. Experimental data reveal that the I-YOLOv10-SC network stabilizes after only 200 training epochs, approximately 100 epochs earlier than the original network. Its bounding box loss, classification loss, and keypoint loss stabilize below 0.65, 0.4, and 1.15, respectively, representing reductions of 18.75%, 27.27%, and 8%. The model also demonstrates significantly reduced oscillation, indicating better training stability.
Ablation studies further validate the effectiveness of each proposed improvement. Space-to-Depth Convolution increases Recall by 1.53% and mAP by 1.31%, while CBAM boosts Precision by 0.2% and mAP by 1.11%. The loss function optimization strategy raises Recall by 1.03% and mAP by 0.91%. As a result, the overall I-YOLOv10-SC network outperforms the original YOLOv10 model with Precision, Recall, and mAP improvements of 5.88%, 6.67%, and 4.26%, respectively, with only a minimal parameter and gradient increase of less than 1 M.
Comparative experiments demonstrate that I-YOLOv10-SC surpasses the original YOLOv10 network, Faster R-CNN, and SSD in key metrics. Precision improved by 5.88%, 26.47%, and 14.11%, respectively, Recall by 6.67%, 19.80%, and 21.82%, F1-score by 6.27%, 23.27%, and 18.16%, and mAP by 4.26%, 19.71%, and 13.46%. The enhanced YOLOv10 network significantly strengthens robustness in small-object detection and adaptability to complex environments, reducing false positives and missed detections. These improvements provide effective technical support for intelligent pest monitoring and precision pest control in tea plantations [
34], laying a solid foundation for future applications in smart agriculture [
35,
36].