TTPRNet: A Real-Time and Precise Tea Tree Pest Recognition Model in Complex Tea Garden Environments

Li, Yane; Chen, Ting; Xia, Fang; Feng, Hailin; Ruan, Yaoping; Weng, Xiang; Weng, Xiaoxing

doi:10.3390/agriculture14101710

Open AccessArticle

TTPRNet: A Real-Time and Precise Tea Tree Pest Recognition Model in Complex Tea Garden Environments

by

Yane Li

^1,2,3,†

,

Ting Chen

^1,2,3,†,

Fang Xia

^4,*,

Hailin Feng

^1,2,3,*

,

Yaoping Ruan

^1,2,3,

Xiang Weng

¹ and

Xiaoxing Weng

⁵

¹

College of Mathematics and Computer Science, Zhejiang A&F University, Hangzhou 311300, China

²

Key Laboratory of Forestry Intelligent Monitoring and Information Technology of Zhejiang Province, Hangzhou 311300, China

³

Key Laboratory of State Forestry and Grassland Administration on Forestry Sensing Technology and Intelligent Equipment, Hangzhou 311300, China

⁴

College of Economics and Management, Zhejiang A&F University, Hangzhou 311300, China

⁵

Research Institute of Tea Resources Utilization and Agricultural Products Processing Technology, Zhejiang Academy of Agricultural Machinery, Jinhua 321017, China

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Agriculture 2024, 14(10), 1710; https://doi.org/10.3390/agriculture14101710 (registering DOI)

Submission received: 27 August 2024 / Revised: 23 September 2024 / Accepted: 25 September 2024 / Published: 29 September 2024

(This article belongs to the Section Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

:

The accurate identification of tea tree pests is crucial for tea production, as it directly impacts yield and quality. In natural tea garden environments, identifying pests is challenging due to their small size, similarity in color to tea trees, and complex backgrounds. To address this issue, we propose TTPRNet, a multi-scale recognition model designed for real tea garden environments. TTPRNet introduces the ConvNext architecture into the backbone network to enhance the global feature learning capabilities and reduce the parameters, and it incorporates the coordinate attention mechanism into the feature output layer to improve the representation ability for different scales. Additionally, GSConv is employed in the neck network to reduce redundant information and enhance the effectiveness of the attention modules. The NWD loss function is used to focus on the similarity between multi-scale pests, improving recognition accuracy. The results show that TTPRNet achieves a recall of 91% and a mAP of 92.8%, representing 7.1% and 4% improvements over the original model, respectively. TTPRNet outperforms existing object detection models in recall, mAP, and recognition speed, meeting real-time requirements. Furthermore, the model integrates a counting function, enabling precise tallying of pest numbers and types and thus offering practical solutions for accurate identification in complex field conditions.

Keywords:

multi-scale tea tree pest; pest recognition; complex environments; YOLOv7-tiny

1. Introduction

Tea tree is an important economic crop in the world, favored by consumers in general. In many countries, tea production serves as a pillar industry [1]. Due to the environmental stability and biodiversity of tea plantations, tea trees are vulnerable to various pests, which often affect the yield and quality of the tea, resulting in serious economic losses for tea farmers [2]. The control methods for different pests vary. Accurate and rigorous identification of tea tree pests is therefore crucial for implementing appropriate pest management strategies, ensuring the healthy growth of tea trees, and producing high-quality tea.

With the advancement of artificial intelligence technology, machine learning and deep learning have gradually replaced manual detection methods as important approaches to crop pest detection and identification [3]. The early pest identification models were developed by combining image processing and machine learning techniques, enabling the extraction of features for pest detection [4]. Deng et al. [5] employed the image saliency technique to extract regions of interest from pest images and combined it with Support Vector Machine (SVM) to identify tea tree pests, achieving a recognition rate of 85.5%. Lu et al. [6] proposed an innovative semi-automatic detection model by analyzing the morphological differences among various locust species. This model integrates image segmentation, feature extraction, and SVM classification techniques to accurately identify both locust species and their developmental stages. Yang et al. [7] introduced a novel image processing method that combines ensemble learning with multi-feature fusion across two color spaces, enabling precise recognition and counting of greenhouse pests. Although these pest identification methods achieve high accuracy, they still face several challenges. Their performance heavily relies on manual feature extraction, which can result in the loss of critical pest details. Moreover, these methods struggle to adapt to diverse environments and a wide range of pest categories, highlighting the need for further research to improve model generalization and robustness.

The performance of the pest identification model relies on features extracted from images. However, traditional models, which rely on manually extracted features, depend on subjective judgment and experience, making them unsuitable for real-world pest identification applications due to their limited performance and generalization capabilities. In recent years, deep learning and computer vision techniques have been widely applied in the field of image recognition [8,9,10,11,12,13,14,15]. Pest identification models developed using deep learning methods, which leverage convolutional mechanisms for feature extraction, have significantly improved both accuracy and robustness [16,17,18]. For instance, Liu et al. [19] used ensemble algorithms to integrate enhanced CNN models such as VGG16 and Inception-ResNet-v2, building a crop disease and pest recognition model with improved accuracy. Liu et al. [20] further enhanced the YOLOv4 model by integrating triple attention mechanisms and a focal loss function. While this model achieved a recognition accuracy of 95.2% on a self-built tomato pest dataset, it struggled to recognize small pests in complex backgrounds and dense plant scenes. This limitation was largely attributed to the dataset’s simplistic background and insufficient consideration of pest size diversity.

To address the challenge of fine-grained agricultural pest recognition in images, Li et al. [21] optimized the rotation invariance of CNN models through data augmentation strategies. This improvement addressed the poor recognition performance caused by multi-scale and variable pest postures in the field, resulting in high recognition accuracy for four types of rice pests in natural environments. Sun et al. [22] expanded the shallow feature range of the YOLOv5s model, effectively addressing the issue of small pests being missed or misidentified in high-density environments. Zhu et al. [23] proposed a method to refine multi-scale fusion features across different dimensions, enhancing the feature expression ability and eliminating conflicting information between different pest characteristics. This approach significantly improved the accuracy of soybean pest detection in complex environments. Tang et al. [24] introduced the ECA mechanism and transformer encoder, combined with a novel cross-stage feature fusion approach, to overcome the limitations of real-time detection for small-scale pests. Hu et al. [25] introduced a hybrid architecture of Transformer and multi-scale attention mechanisms into the YOLOX model, which significantly boosted the detection capabilities for small target pests. These studies have enhanced the models’ ability to extract multi-scale features, effectively capturing critical characteristics of small pests, thus reducing the rate of missed and false detections.

Although significant advancements have been made in pest recognition algorithms based on deep neural networks, practical agricultural applications demand models that maintain high precision while being easy to deploy. Current research into lightweight pest recognition methods has primarily focused on two areas: reducing training costs to improve deployment efficiency and designing lightweight modules to minimize computational resource consumption. Gan et al. [26] leveraged transfer learning and attention mechanisms to improve the EfficientNet model, achieving efficient pest recognition with an accuracy of 69.5% while maintaining a sufficiently lightweight structure. Min and Wei [27] developed a high-precision, lightweight real-time detection model for Tephritidae pests, with a size of only 2.4 MB, by integrating innovative Multicat and C2flite modules into YOLOv8 and optimizing the number and size of detection heads. Liang et al. [28] proposed a lightweight model, GBW-YOLOv5, which reduced the model size by 66.7%. This model successfully met the stringent real-time requirements for multi-scale cotton pest detection in complex field environments. These studies demonstrate that deep learning-based pest recognition methods offer substantial advantages over traditional methods that rely on manual feature extraction.

However, most datasets used in existing research are limited in background complexity, as they are typically captured in laboratory environments with relatively simple backgrounds. This constraint reduces the model’s effectiveness when applied to complex real-world images. Moreover, accurate identifying pests presents additional challenges due to inter-species similarity; intra-species diversity; variable pests postures; and the complex backgrounds of a real tea garden, which include elements such as leaves, tree branches, and soil [29,30,31]. These factors together test the robustness and accuracy of image recognition algorithms. To address the recognition challenges and detection omissions caused by complex backgrounds, we proposed a lightweight model called TTPRNet. This method is designed to effectively capture pest details at various scales, ensuring high precision and rapid pest identification in images with complex backgrounds.

Figure 1 shows the overall framework of the tea tree pest recognition model proposed in this paper. The main contributions of this paper are as follows: First, we propose a lightweight model, TTPRNet, capable of accurately and efficiently identifying multi-scale tea tree pests in complex environments, including varying light conditions and vegetation densities, while maintaining high accuracy and speed. Second, a novel network structure has been designed by replacing the traditional ELAN structure in the CSPDarknet53 backbone of the YOLOv7-tiny model with a parallel network composed of ConvNeXt and ELAN. This parallel structure extends the model’s receptive field and effectively prevents feature loss, thereby enhancing overall model performance. Third, the model performs well in detecting multi-scale pests and effectively identifying pests with high inter-specific similarity, such as those from the same family but different species. Fourth, this model not only improves the accuracy and speed of pest recognition but also integrates a pest counting feature, providing a basis for pest control decisions.

2. Materials and Methods

2.1. Tea Tree Pest Dataset

2.1.1. Data Acquisition

The dataset used in this study consists of two groups. The first group includes 1697 images of tea tree pests collected from an experimental tea garden at Zhejiang Agriculture and Forestry University. These images were captured using a mobile phone (Mi 9) from various angles and under different backgrounds, including tea tree branches, dead leaves, and soil. The distance for capturing the images was set between 10 and 30 cm, covering both simple and complex backgrounds. The dataset includes target pests of small, medium, and large scales to ensure diversity. In the second group, 565 additional images of tea tree pests were obtained via web crawling to ensure the generalization of the dataset.

To further distinguish between different species within the same pest family, we collected images of four scarab species—Anomala corpulenta Motschulsky, Holotrichia parallela, Miridiba sinensis, and Tawny beetle—because they pose different harms to tea trees and their control measures are different. For example, Anomala corpulenta Motschulsky and Holotrichia parallela eat leaves, Miridiba sinensis eat roots, and Tawny beetle eat stems, making species-specific classification essential. After manual screening and classification, a total of 2076 valid images were collected, encompassing 12 categories of tea tree pests. The ratio of images with simple versus complex backgrounds was 3:5, and the distribution of small-, medium-, and large-scale pests was balanced at approximately 1:1:1. The categories and quantities of pests are detailed in Table 1, while sample images are displayed in Figure 2.

2.1.2. Data Processing

In this paper, LabelImg software (version1.8.6) was used to annotate the location and category of tea tree pests in the images. The dataset was divided into training, validation, and test sets with a ratio of 6:2:2, comprising 1245, 415, and 416 images, respectively. Given the limited number of data samples, there is a risk of overfitting during training. To address this, we employed data augmentation techniques using the YOLOv7-tiny model’s data enhancement techniques [32], which include random scaling, cropping, flipping, adjusting HSV, mosaic, and mixing of images, to improve the model’s generalized ability. Examples of training samples after these augmentation operations are shown in Figure 3. By implementing these data enhancement strategies, the model can better generalize previously unseen data, ensuring stable and reliable performance across various data distributions and backgrounds.

2.2. TTPRNet Framework Overview

This paper proposes a high-performance and fast tea tree pest identification model, developed by enhancing the YOLOv7-tiny network. YOLOv7-tiny is a lightweight version of YOLOv7, comprising four main components: the input layer, backbone network (backbone), neck network (neck), and prediction network [32]. The YOLOv7-tiny detection model achieves high precision with fewer parameters and faster detection speeds, making it well suited for real-time tasks. As a lightweight model, YOLOv7-tiny strikes an excellent balance between performance and computational resource consumption, making it suitable for resource-constrained environments and aligning with practical application needs. To further support the rationale for choosing YOLOv7-tiny, Table 2 provides a comparative summary of the performance metrics for various YOLO models. Through this comparison, it was observed that the performance of YOLOv5s and YOLOv8n is quite similar to that of YOLOv7-tiny. Therefore, we also modified these three models to further validate the performance of YOLOv7-tiny, as detailed in Section 3.3.

Although YOLOv7-tiny is lightweight, it employs a substantial number of efficient aggregated network structures (ELANs) in both its backbone and neck. This design choice complicates the retention of detailed features and the handling of long-distance relationships as the network depth increases. Additionally, the aspect ratio described by the CIoU loss function adopted and used in YOLOv7-tiny is relative, failing to adequately capture the differences in width and height or confidence levels, which introduces some ambiguity that does not account for the balance between difficult and easy detections. Consequently, maintaining consistent detection accuracy across samples of varying scales poses a challenge. In this study, we used the YOLOv7-tiny network as the base model and optimized its structure to enhance the feature extraction ability and overall recognition performance.

To improve the efficiency of tea tree pest detection, especially focusing on solving the challenge that small-scale pests are not easy to recognize under complex backgrounds, an improved YOLOv7-tiny network, named TTPRNet, is proposed and used to establish a tea tree pest recognition model. The structure of TTPRNet is shown in Figure 4. The improved model meets the requirements for high accuracy, real-time processing, and easy of deployment in real tea garden environments. The specific improvements are as follows: (1) The ConvNeXt Block module is introduced into the YOLOv7-tiny backbone network, replacing part of the highly efficient aggregation network (ELAN) in the original YOLOv7-tiny. This reconfiguration of the backbone network enhances its ability to capture multi-scale and multi-granularity feature information, reduces the number of network parameters, and thus improves real-time performance. (2) The coordinate attention (CA) mechanism is integrated into both the spatial pyramid (SPPCSPC) layer and the output feature layer of the backbone network. This integration improves the model’s ability to capture tiny targets in complex backgrounds and enhances the extraction of multi-scale features. (3) The lightweight convolution GSConv replaces the common convolution in the neck network, reducing the number of channels and computational costs while enhancing the effectiveness of the attention module. (4) The NWD loss function is introduced to refine the matches between the true and predicted frames of the CIOU loss function, focusing on the similarity between targets of different scales.

2.2.1. Reconfiguring the Backbone Network

The YOLOv7 backbone network uses the BottleneckCSP structure, which splits the output into two branches: one for residual operations and the other for convolution operations. Subsequently, it merges the outputs of the two branches to maintain the input and output information at the same size. When the model consists of many residual blocks, it enhances the efficiency of feature extraction, but this also increases the number of parameters, which impacts processing speed and makes real-time detection more challenging in practical application scenarios. Additionally, in our datasets, there are images with low pixel information, which poses a challenge for recognizing tea tree pests. The reduction in image resolution causes the detailed features of pests to blur, while increased noise and uncertainty in the target scale make it difficult to extract features accurately and classify pests correctly.

ConvNeXt is a convolutional neural network model proposed by Liu et al. [33], known for its high precision, efficiency, scalability, and simple architecture. The ConvNeXt Block is the fundamental structure of ConvNeXt, which progressively transforms ResNet according to the design concept of Swin Transformer [34]. Compared to the standard ResNet Block, the ConvNeXt Block features a reversed bottleneck structure, while further optimizing the convolutional kernel size and activation function. The structure of the ConvNeXt Block is shown in Figure 5. ConvNeXt Block helps maintain global awareness of the input feature matrix, reduces the number of model parameters, and enhances the training speed.

As illustrated in Figure 4, this paper optimizes the two detection layers in the traditional BottleneckCSP structure during the reconstruction of the backbone network. The efficient aggregation network of the ELAN module is replaced with the CNeB module. Furthermore, a parallel network structure combining CNeB and ELAN is constructed. The CNeB module integrates three convolutional operations based on the ConvNeXt Block. This enhanced structure significantly improves the model’s feature extraction capability in complex scenarios, such as blurry backgrounds and the difficulty in distinguishing tea tree pests. Additionally, it expands the model’s perceptual range, enabling it to maintain a high recognition accuracy even in challenging visual scenes.

2.2.2. Coordinate Attention

In the field of pest detection, there exists a similarity in features among pests of different sizes or species, especially within the same family. Additionally, the feature information is easily lost in the deep neural network. This leads to the problem of false and missing detection. To overcome this challenge, we introduce the coordinate attention mechanism [35] into the network to improve the model’s object recognition ability. After introducing the attention mechanism module, the mapping features are assigned higher weights, allowing the network to learn and capture these features more effectively. This is particularly important for accurately localizing and detecting pests in complex environments, especially for the detection of tiny targets, which are often difficult to effectively capture in conventional models.

As shown in Figure 6, the CA module is different from traditional channel attention mechanisms. Instead of converting the feature tensor into individual feature vectors via 2D global pooling, it decomposes channel attention into two 1D feature encoding processes, aggregating features along two spatial directions, respectively. In this way, remote dependencies can be captured along one spatial direction while precise location information can be retained along another spatial direction. The generated feature maps are then encoded into a pair of direction-aware and position-aware attention maps, respectively, which are complementary and applied to the input feature maps to enhance the representation of the target object. As a result, the model is robust even in complex environments. Compared to other attention modules such as CBAM [36] and SE [37], the CA module fully utilizes horizontal and vertical spatial position information and integrates it into channel attention. This approach minimizes the loss of spatial information for target pests, enabling the model to focus on critical details such as texture, and other feature information, and then effectively locate the area we need, thereby reducing the false and missed detection.

We implement the embedding of CA modules in the important components of the model, which further improves the performance of the model. Specifically, (1) the CA module is integrated into the two branches of the SPPCSPC structure. Features of different scales can be captured to further improve the detection accuracy by inserting the attention mechanism into the SPPCSPC structure, which can capture feature information at various scales. (2) The CA module is integrated into the two output feature layers of the backbone network. The backbone network is the core component of the network, and the model’s ability to extract key information can be enhanced by integrating the attention mechanism into the output feature layer, thus improving the overall detection performance.

2.2.3. GSConv

GSConv is a lightweight convolution proposed by Li et al. [38] that not only reduces the number of parameters and improves the detection speed but also improves detection accuracy. GSConv combines traditional convolution with Depthwise Separable Convolution (DSC) to enhance the network’s expressive power. This approach lowers the computational complexity while better balancing the model’s accuracy and speed. The structure of GSConv is shown in Figure 7. GSConv employs a randomized strategy to infiltrate information generated by traditional convolution into each part of the DSC. This ensures that feature information is uniformly exchanged among different channels in the DSC output without introducing additional complexity. Consequently, it effectively mitigates the significant potential degradation in detection accuracy that can occur when using DSC alone. This approach ensures that feature information from traditional convolution is fully mixed with the DSC output, thus enhancing the model’s ability to capture meaningful features and maintain detection accuracy. Nevertheless, the model layers become deep if GSConv is used extensively, and the deep network layers might slow down data flow and lengthen inference times. This paper solely uses GSConv in the neck network to satisfy real-time needs. The feature map processed by GSConv is characterized by less redundant information and less compression, and its use after the attention module produces better feature fusion.

2.2.4. Normalized Wasserstein Distance

The bounding box regression loss function affects the performance of target localization. The loss function of YOLOv7-tiny includes three parts, namely, object loss, classification loss, and bounding box loss. The model originally employed the CIoU loss function used for bounding box regression, which shows a large difference in sensitivity to targets of different sizes when evaluating the differences between target boxes, resulting in slower training and inference. At the same time, CIoU is not sensitive enough to small-scale objects. As the size of the object decreases, the effect of the CIoU metric is less obvious. As a result, it is difficult to detect tea tree pests across small, medium, and large scales. To overcome this issue, the Normalized Wasserstein Distance (NWD) [39] is introduced to measure the similarity between the 2D Gaussian distribution of the predicted and the real frame, and replace the original CIoU in a ratio of 0.5:0.5. The Normalized Wasserstein Distance is robust to the position, scale, and rotation of the object, and it applies to a wide range of targets with different shapes and sizes of pests. The CIoU is more effective in evaluating medium and large targets; this fusion approach will further improve the performance of the tea tree pest recognition model under complex tea garden environments.

Since most real objects are not strictly rectangular, their bounding boxes often contain some background. To characterize the weights of foreground and background pixels in the bounding box, the bounding box

R = [c x, c y, w, h]

can be modeled as a two-dimensional Gaussian distribution

N (μ,)

, as shown in Equation (1), where

(c x, c y), w, h

are the center coordinates, width, and height, respectively.

μ

is the mean matrix of the 2D Gaussian distribution, and

\sum

is the covariance matrix of the 2D Gaussian distribution.

μ = [\begin{array}{l} c x \\ c y \end{array}], Σ = [\begin{matrix} \frac{w^{2}}{4} & 0 \\ 0 & \frac{h^{2}}{4} \end{matrix}]

(1)

The similarity between the bounding boxes

R_{a}

and

R_{b}

can be characterized by the distribution distance between the 2D Gaussian distributions of

N_{a}

and

N_{b}

. The distribution distance can be calculated using the Wasserstein distance. The Wasserstein distance between

N_{a}

and

N_{b}

can be expressed by Equation (2).

W_{2}^{2} (N_{a}, N_{b}) = {‖({[c x_{a}, c y_{a}, \frac{w_{a}}{2}, \frac{h_{a}}{2}]}^{T}, {[c x_{b}, c y_{b}, \frac{w_{b}}{2}, \frac{h_{b}}{2}]}^{T})‖}_{2}^{2}

(2)

where

W_{2}^{2} (N_{a}, N_{b})

is a distance metric and cannot be directly used as a similarity metric. In order to transform it into a normalized similarity metric with values in the range of [0, 1], it is exponentiated and normalized to obtain the new similarity metric NWD, as shown in Equation (3).

N W D (N_{a}, N_{b}) = \exp (- \frac{\sqrt{W_{2}^{2} (N_{a}, N_{b})}}{C})

(3)

In Equation (3), C is a constant closely related to the dataset, and its default value of 12.8 is retained in this study.

3. Results

3.1. Experimental Environment Configuration

The configuration of the environment parameters in this study is shown in Table 3. To determine the specific settings used for model training, we performed the hyperparameter configurations as follows. The image input size was configured to 640 × 640. SGD was selected as the optimization algorithm. The initial learning rate was set as 0.002, accompanied by momentum and weight decay parameters set as 0.937 and 0.005, respectively. A cosine annealing strategy was employed for learning rate reduction. The training batch size was set to 16, and a total of 300 epochs was set for training. All other hyperparameters were maintained as default values. For each experiment, the training parameters and experimental environments were maintained consistently to ensure comparable results.

3.2. Evaluation Indicators

This paper employs the Mean Average Precision (mAP), Precision (P), Recall (R), and Parameters as metrics to evaluate model performance. mAP assesses the precision of the model for all categories. P denotes the probability of correctly detected images containing pests for all images predicted as correct. R represents the proportion of correctly detected pests to the actual presence of pests. These metrics are expressed in Equations (4)–(8).

P r e c i s i o n = \frac{T P}{T P + F P}

(4)

R e c a l l = \frac{T P}{T P + F N}

(5)

A P = \int_{0}^{1} P (R) d R

(6)

m A P = \frac{1}{N} \sum_{1}^{N} A P

(7)

P a r a m e t e r s = C_{i n} \times C_{ou t} \times K^{2}

(8)

TP denotes the number of pests correctly identified by the model, FP denotes the number of pests incorrectly identified, and FN denotes the number of pests missed by the model. AP refers to the area under the precision–recall (P-R) curve, equivalent to average precision, while mAP represents the mean AP across all pest categories. N refers to the number of pest categories. The number of parameters in the model, referred to as ‘Parameter,’ indicates the model’s compactness—fewer parameters make the model more suitable for deployment on mobile devices. Additionally, this paper uses the number of floating point operations (FLOPs) and frames per second (FPSs) to assess model complexity. FPS measures the number of frames processed per second, with higher values indicating faster processing speed.

3.3. Improved Model Comparison Experiment

As mentioned in Section 2.2, the performance of YOLOv5s, YOLOv8n, and YOLOv7-tiny is relatively similar. To determine which model performs best, we applied the same modifications to all three models, with the results presented in Table 4. After a comprehensive evaluation, we concluded that the TTPRNet model, based on improvements to YOLOv7-tiny, delivered the best results. Therefore, the subsequent sections focus on a detailed analysis and validation of the TTPRNet model.

3.4. Ablation Experiments

3.4.1. Experiments on Different Embedding Positions of CNeB Modules

To comprehensively evaluate the performance of the CNeB module and explore its optimal embedding position within the model, this study conducted five sets of ablation experiments by introducing the CNeB module into YOLOv7-tiny’s backbone and neck. The results are shown in Table 5. Experiment II integrates the CNeB module into the backbone, replacing all the ELAN modules, to evaluate its impact on the overall network structure. Experiment III represents the application of the CNeB module embedded in the neck, replacing all the ELAN modules, to explore its efficacy in the feature extraction and fusion phases. Experiment IV adopts a compromise strategy by replacing the ELAN modules at each end of the backbone with the CNeB module, testing the impact of this local integration on the model’s performance. Finally, Experiment V focuses on replacing the middle two ELAN modules within the backbone with CNeB modules. It aims to observe the contribution of the improvements made at the network’s middle level to enhance the model’s overall performance.

From Table 5, we can observe that replacing the two ELAN modules in the middle of the YOLOv7-tiny backbone network with the CNeB module resulted in improvements in the key performance metrics. Specifically, the model’s R improved by 2%, and the mAP improved by 1%. Among the four sets of experiments, only this modification led to an improvement in both mAP and R, indicating that the improved network structure effectively enhances the model’s ability to detect and localize targets. Although there was a slight decrease in P by approximately 0.6%, this trade-off is acceptable, as it occurs due to structural changes in the backbone network. Importantly, this minor reduction in precision does not significantly affect the overall performance of the model when weighed against the improvements in R and mAP. The advantage of the CNeB module lies in its use of a large 7 × 7 convolutional kernel instead of the traditional smaller kernel. This design not only reduces the number of convolutional operations, lowering the model’s computational complexity but also improves its ability to capture global information. As a result, it directly enhances the feature extraction process, providing the model with richer and more effective information for accurate target detection.

3.4.2. Attention Module Ablation Experiments

We compared and analyzed the CA attention mechanism added in TTPRNet with CoT [40], SE [37], and CBAM [36]. The comparative results shown in Table 6 indicate that the CA module outperforms the CoT, SE, and CBAM modules, with respective mAP improvements of 1.5%, 2.2%, and 3%. This demonstrates the CA module has a significant advantage in improving the model’s feature expression ability.

SE focuses solely on channel-wise information and neglects spatial positional cues, limiting its ability to capture long-range dependencies. While CBAM attempts to introduce positional information via convolution, its capacity to capture fine-grained features is restricted by the locality of convolutional kernels. CoT can effectively capture long-range dependencies but comes with a higher computational cost, making it unsuitable for resource-constrained mobile network. In contrast, CA excels at capturing both directional and positional information, enhancing the accurate localization and identification of pest regions. CA captures long-range dependencies within input feature maps and encodes precise positional information with a global receptive field. Additionally, its flexible and lightweight structure enables easy integration into mobile networks, significantly improving feature representation without increasing computational burden.

The Grad-CAM technique [41] provides heat map visualizations that illustrate the model’s focus during detection by highlighting the similarity between each position in an image and its corresponding class. To further explore the impact of attention mechanisms on the model’s ability to recognize pests, we present the heat map visualization results, as shown in Figure 8. These visualizations reveal the regions of high-feature aggregation that the model prioritizes during detection, with darker colors indicating stronger attention to specific areas. Notably, the model embedded with the CA module exhibits more precise red regions in the heat map, indicating enhanced attention to pest details compared to other attention mechanisms. Due to its superior performance, we opted to integrate the CA attention module into the YOLOv7-tiny model.

3.4.3. TTPRNet Ablation Experiments

We conducted eight sets of ablation experiments based on the YOLOv7-tiny model on a tea tree pest dataset; the results are shown in Table 7. We aimed to evaluate the effectiveness of model improvements, which were realized by integrating four modules: the CNeB module, the CA module, the GSConv module, and the NWD loss function.

Table 7 demonstrates the performance of the original YOLOv7-tiny model, which achieved a mAP of 88.8%. After introducing the CNeB module to reconfigure the backbone, both R and mAP improved by 2% and 1%, respectively, while P slightly decreased by 0.6%. This not only enhanced the recognition speed but also reduced the model’s parameters. Additionally, the CA attention module plays a crucial role in feature fusion, effectively capturing key features of pests. Compared to the original model, the introduction of the CA module resulted in a 3% improvement in mAP, a 1.7% increase in P, and a significant 6.4% increase in R. Furthermore, the FPS increased by 31.3 frames/s, demonstrating the module’s positive impact on processing speed.

By applying GSConv in place of traditional convolutional operations, the model effectively eliminated redundant information, improving the model’s precision. Specifically, mAP improved by 1%, R improved by 5.3%, and FPS improved by 28.8 frames/s, although P decreased by 3.6%.

The introduction of the NWD loss function yielded notable gains across key metrics, with P, R, and mAP increasing by 2.8%, 5%, and 1.8%, respectively. This result verifies the effectiveness of the NWD loss function in improving the model’s localization precision for target pests.

Finally, TTPRNet, which was created by fusing the four modules mentioned above, showed a significant improvement in precision compared to the original model, with the key performance metrics—R, mAP, and FPS—increased by 7.1%, 4%, and 38.9 frames/s, respectively, and the amount of model parameters reduced by 320K. The results from these eight ablation experiments fully demonstrate the significant improvement in recognition precision and speed of the module proposed in this paper.

3.4.4. Assessment of Counting Functionality for TTPRNet

To validate the proposed model’s pest counting performance under varying levels of background complexity, we selected 46 images for each group. The results demonstrate that the model’s counting accuracy is satisfactory, as shown in Table 8. The calculation formulas are expressed by Equations (9)–(11). In the detection result images, the pest categories and their respective counts are displayed in white text in the top-left corner, as shown in Figure 9. From left to right, the detected pests and their quantities are as follows: one Anomala corpulenta Motschulsky, two Holotrichia parallela, and two Aleurocanthus spiniferus.

A c c u r a c y = \frac{C o r r e c t D e t e c t i o n s}{A c t u a l P e s t C o u n t} \times 100 %

(9)

M i s s D e t e c t i o n R a t e = \frac{A c t u a l P e s t C o u n t - C o r r e c t D e t e c t i o n s}{A c t u a l P e s t C o u n t} \times 100 %

(10)

F a l s e D e t e c t i o n R a t e = \frac{D e t e c t e d P e s t C o u n t - C o r r e c t D e t e c t i o n s}{D e t e c t e d P e s t C o u n t} \times 100 %

(11)

3.5. Comparative Experiments

Compared to the YOLOv7-tiny model, the TTPRNet model improves P, R, and mAP by 0.8%, 7.1%, and 4%, respectively, and the recognition speed by 26.6%. Figure 10 visualizes the difference in recognition performance before and after model improvement. Additionally, Table 9 and Table 10 compare the model’s detection performance before and after improvement across different scales of pests under different complex backgrounds.

As shown in Table 9, the original YOLOv7-tiny model exhibits the following performance in detecting pests of different scales. For small, medium, and large pests, the model achieves AP of 82.1%, 94%, and 87.3%, respectively; R of 80%, 90.4%, and 78.9%, respectively; and P of 72.3%, 77.4%, and 89.3%, respectively. These results reveal the performance discrepancies of the model when handling pests of different sizes.

The improved model achieved overall improvements in detecting pests across all scales. The intuitive detection results are shown in Figure 11. Specifically, the AP for small, medium, and large pests improved by 1.8%, 1.3%, and 9.1%, respectively, showing the model’s improved capability to recognize pests of various scales. In terms of R, the improved model also shows notable increases of 4.8%, 2.3%, and 14.5% for small, medium, and large pests, respectively, underscoring its improved effectiveness in identifying potential targets, with the most significant gain observed in the detection of large pests. However, p-values vary across different target sizes. The model shows a notable 3.1% increase in detecting large pests and a slight 0.1% improvement for medium-sized pests, but a 1.2% decrease for small pests. This indicates that further optimization is required to enhance the model’s feature extraction and classification performance, particularly for small-scale targets.

As shown in Table 10, the original YOLOv7-tiny model exhibits varying levels of performances in pest detection task across backgrounds of different complexity. For simple, medium, and complex backgrounds, it achieves AP of 87.8%, 94.2%, and 84.7%, respectively, with corresponding R of 88.8%, 84.4%, and 80.7% and P of 75.8%, 87%, and 73.5%. These results provide a benchmark for the model’s performance across different background complexities.

The improved model improves the detection performance under all background complexities, as illustrated by the detection results in Figure 12. Specifically, for simple, medium, and complex backgrounds, APs improve by 6.8%, 1.6%, and 0.4%, respectively, highlighting the model’s enhanced ability to detect pests in various environments. In terms of R, the improved model also performed well on medium and complex backgrounds, with increases of 9.2% and 4.4%, respectively, indicating greater efficiency in identifying potential targets.

However, P shows variability across different background complexities. While there is a significant improvement of 12.3% on simple backgrounds, P decreases slightly by 3% on medium backgrounds and by 3.4% on complex backgrounds. This suggests that the model’s classification accuracy needs to be further improved when dealing with more complex and distracting backgrounds. It is important to note that the classification of image background complexity is based on subjective judgment, which may have influenced the evaluation results. Nonetheless, our primary focus remains on the model’s overall accuracy and its ability to identify target pests under various background conditions.

3.6. Comparison Experiments with Classical Detection Models

To further validate whether our proposed model is superior to classical models in tea tree pest detection, we conducted experiments using identical parameters and datasets to compare TTPRNet with ten existing target detection models: SSD, CenterNet, EfficientDet, YOLOv5s, YOLOv5m, YOLOX-tiny, YOLOXs, YOLOv7, YOLOv8n, and YOLOv8s. The results are shown in Table 11 and Figure 13.

Table 11 shows that our proposed TTPRNet model outperforms other mainstream models in terms of detection accuracy. It achieves the highest values in two key metrics—R, at 91%, and mAP, at 92.8%—demonstrating TTPRNet’s high accuracy and reliability in detecting tea tree pests. While TTPRNet does not have the fewest parameters among the tested models, it achieves a detection speed of 173.4 FPS, which is significantly faster than the other models and fulfills the real-time detection requirements. When compared to models with fewer parameters, such as EfficientDet, YOLOX-tiny, and YOLOv8n, TTPRNet surpasses them by 29.4%, 21.1%, and 3% in mAP and by 1.9%, 11%, and 0.3% in P, highlighting its practical applicability. Furthermore, TTPRNet outperforms SSD, CenterNet, and YOLOXs, achieving mAP improvements of 22.1%, 25.1%, and 20.1%, respectively. Within the YOLO series, TTPRNet also demonstrates enhancements. It improves mAP by 2.6%, 1.5%, 1.5%, and 1.2% compared to YOLOv5s, YOLOv5m, YOLOv7, and YOLOv8s, respectively. In terms of recall, TTPRNet shows improvements of 2%, 6.8%, 4.6%, and 6.2% over these models. These results underscore TTPRNet’s significant performance enhancements through technological innovations, while still adhering to the foundational design principles of the YOLO series.

While TTPRNet slightly outperformed the SSD, EfficientDet, YOLOv5m, YOLOX-tiny, and YOLOv8n models in P, it is noteworthy that necessary compromises were made in the design of the TTPRNet modules’ fusion to achieve real-time detection performance. This design strategy not only ensures superior processing speed but also maintains competitiveness in accuracy.

Figure 13 illustrates that the YOLOv5 series, YOLOv8 series, and TTPRNet exhibit comparable detection performance, possibly due to their common design principles and architectures. However, TTPRNet stands out in its ability to capture fine details, which is crucial for detecting small pests in tea plantations with complex backgrounds. This feature makes TTPRNet not only competitive in terms of parameters and detection speed but also uniquely valuable in practical agricultural applications, particularly for the precise detection of tea tree pests.

4. Discussion

It is difficult to detect tea tree pests under complex backgrounds in natural environments. Accurately extracting pest features becomes critical. Some recent studies have focused on improving the model’s multi-scale feature extraction capability to reduce background interference. Three newer models in the field of pest recognition were selected for comparison in this study. Qiang et al. [42] used a dual backbone and fused deep and shallow features to improve the recognition performance of the SSD model; although achieving a mAP of 86.01% on the citrus pest dataset, the method exhibits recognition errors when facing similar pests. Zhao et al. [43] integrated the CBAM attention module into the YOLOv7 model to suppress distracting background information, allowing the model to focus more effectively on the pest region. In this study, this method achieved an integrated mAP of 90.3%, which represents a 1.5% improvement over the original model but still falls short compared to the CA attention module integrated into our study. Xu et al. [44] enhanced the model’s ability to capture multi-scale pests by employing convolutional kernels of different sizes, coupled with the Inception module to extract features at various scales in parallel. Their experiments on the rice pest dataset yielded a mAP of 91.4%, but real-time applications were not considered.

Continuously optimizing the model architecture, our study successfully improves the model’s anti-interference ability under complex backgrounds, achieving a balance between accuracy and detection speed. As shown in Table 12, the TTPRNet model achieves a mAP of 92.8%. In comparison, the mAP of the three models mentioned above is 6.79%, 4.6%, and 1.4% lower than the model established in this study, respectively. Additionally, the TTPRNet model shows a slight advantage in lightweight performance when comparing FPS and single-image detection time.

Additionally, we selected an image containing pests from the same genus but different species to compare the detection performance among the models. The recognition results are shown in Figure 14.

Figure 14 presents the detection results of the ten models for three different scarabs, highlighting the TTPRNet model’s superior accuracy in detection and bounding box prediction compared to other models. The EfficientDet model performs poorly, failing to recognize the target object. In contrast, the CenterNet, YOLOXs, and YOLOX-tiny models were able to identify the scarabs but still experienced missed detections. Further observation reveals that the SSD, YOLOv5s, YOLOv7-tiny, YOLOv7, and YOLOv8n models were misclassified during the recognition process, incorrectly recognizing the Miridiba sinensis on the right side of the image as Holotrichia parallela. The SSD model not only misclassified the target but also exhibited omission. The YOLOv8n model encountered more serious issues, misclassifying both Miridiba sinensis and Anomala corpulenta Motschulsky as Holotrichia parallela. Notably, the YOLOv5m model detected Miridiba sinensis but exhibited a double misclassification: while correctly detecting Miridiba sinensis, it also misclassified it as Holotrichia parallela and additionally incorrectly identified the background plant area as Apolygus lucorum. This dual misclassification was particularly evident in its results. On the other hand, the YOLOv8s model successfully detected Miridiba sinensis but still experienced missed detections. Among all evaluated models, only TTPRNet correctly identified all targets without false or missed detections. Additionally, it displayed both the category and count of the pests in the upper left corner of the resultant figure, with one each of Anomala corpulenta Motschulsky, Holotrichia parallela, and Miridiba sinensis, demonstrating its accuracy and reliability in target detection.

While the TTPRNet model demonstrates significant performance advantages in tea tree pest recognition, there remains room for improvement, particularly in terms of P compared to other models. This limitation in P may be attributed to two factors: category imbalance in the dataset and the IoU threshold settings.

Firstly, our constructed dataset exhibits significant variability in sample numbers across different categories, with some categories containing over 300 samples while others having only around 100. This imbalance poses challenges to the model’s generalization ability. Secondly, while the dataset comprises images of various tea tree pests, the total number of images per pest category remains relatively limited. To enrich our study, future efforts should focus on expanding this dataset with more diverse and extensive image samples. Finally, considering the diverse environmental conditions encountered in real-world applications, such as backlighting and adverse weather, future research should also include pest images captured under these non-ideal conditions. Integrating such images will enhance the robustness of the pest identification model, ensuring consistent pest recognition and categorization even under variable natural conditions, thereby improving the accuracy and reliability of pest monitoring and control.

Despite the limitations of the current study, the proposed model has demonstrated a high recognition accuracy, fast detection speeds, and low parameter requirements, indicating its potential for deployment on mobile devices. These features underscore the innovative and technologically advanced nature of the research and highlight its practical value in real-world applications, particularly in pest control for tea plantations. Rapid and accurate pest recognition on mobile devices can help tea farmers implement timely control measures, reducing the impact of pests on tea yield and quality, thereby promoting the sustainability of agricultural production.

5. Conclusions

In this study, we proposed a novel target recognition model named TTPRNet, designed to meet the need for accurate pest identification in complex tea garden environments. This model significantly enhances the ability to capture global information by incorporating the ConvNeXt architecture into the backbone, expanding the sensory field and enhancing performance in complex scenes. To further boost feature extraction and reduce background interference, the CA attention module was fused into the backbone’s output feature layer. This innovation significantly improved the model’s recognition accuracy in complex tea garden scenes.

Additionally, replacing ordinary convolution in the neck with GSConv convolution effectively reduced redundant information and enhanced the feature extraction efficiency. For bounding box regression, we employed the CIoU equal scale fusion NWD loss function, which accelerated network convergence and improved the localization accuracy. The experimental results demonstrate that the model achieved a mAP of 92.8% and 184.6 FPS in the pest detection tasks, significantly enhancing both recognition efficiency and accuracy compared to existing algorithms.

This study provides an efficient and accurate method for detecting pests in tea gardens, significant for developing scientific pest control strategies and promoting sustainable tea garden development. The innovative pest recognition model can effectively assist in the monitoring and control of tea pests, providing robust support for the success of tea plantations and contributing to the sustainable growth of the tea industry.

Author Contributions

Conceptualization, Y.L. and T.C.; methodology, Y.L. and T.C.; software, T.C.; formal analysis, Y.L. and T.C.; investigation, Y.L. and T.C.; resources, H.F., Y.R., X.W. (Xiang Weng) and X.W. (Xiaoxing Weng); data curation, T.C.; writing—original draft, Y.L. and T.C.; writing—review and editing, Y.L. and T.C.; visualization, T.C.; supervision, H.F. and F.X.; project administration, H.F.; funding acquisition, Y.L., H.F., Y.R. and X.W. (Xiaoxing Weng). All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Science Foundation of Zhejiang Province (LGF20F020002), the Key R&D Projects in Zhejiang Province (2021C02028, 2022C02009, 2022C02044, and 2022C02020), the Three agricultural nine-party science and technology collaboration projects of Zhejiang Province (2022SNJF036), and the Research Development Foundation of Zhejiang A&F University (2019RF065).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wang, S.; Wu, D.; Zheng, X. TBC-YOLOv7: A Refined YOLOv7-Based Algorithm for Tea Bud Grading Detection. Front. Plant Sci. 2023, 14, 1223410. [Google Scholar] [CrossRef] [PubMed]
Hu, G.; Wu, H.; Zhang, Y.; Wan, M. A Low Shot Learning Method for Tea Leaf’s Disease Identification. Comput. Electron. Agric. 2019, 163, 104852. [Google Scholar] [CrossRef]
Liu, Q.; Zhang, Y.; Yang, G. Small Unopened Cotton Boll Counting by Detection with MRF-YOLO in the Wild. Comput. Electron. Agric. 2023, 204, 107576. [Google Scholar] [CrossRef]
Martineau, C.; Conte, D.; Raveaux, R.; Arnault, I.; Munier, D.; Venturini, G. A Survey on Image-Based Insect Classification. Pattern Recogn. 2017, 65, 273–284. [Google Scholar] [CrossRef]
Deng, L.; Wang, Y.; Han, Z.; Yu, R. Research on Insect Pest Image Detection and Recognition Based on Bio-Inspired Methods. Biosyst. Eng. 2018, 169, 139–148. [Google Scholar] [CrossRef]
Lu, S.; Ye, S. Using an Image Segmentation and Support Vector Machine Method for Identifying Two Locust Species and Instars. J. Integr. Agric. 2020, 19, 1301–1313. [Google Scholar] [CrossRef]
Yang, Z.; Li, W.; Li, M.; Yang, X. Automatic Greenhouse Pest Recognition Based on Multiple Color Space Features. Int. J. Agric. Biol. Eng. 2021, 14, 188–195. [Google Scholar] [CrossRef]
Zhang, C.; Xia, K.; Feng, H.; Yang, Y.; Du, X. Tree Species Classification Using Deep Learning and RGB Optical Images Obtained by an Unmanned Aerial Vehicle. J. For. Res. 2021, 32, 1879–1888. [Google Scholar] [CrossRef]
Shi, C.; Xu, J.; Roberts, N.J.; Liu, D.; Jiang, G. Individual Automatic Detection and Identification of Big Cats with the Combination of Different Body Parts. Integr. Zool. 2023, 18, 157–168. [Google Scholar] [CrossRef]
Xu, J.; Zhou, S.; Xia, F.; Xu, A.; Ye, J. Research on the Lying Pattern of Grouped Pigs Using Unsupervised Clustering and Deep Learning. Livest. Sci. 2022, 260, 104946. [Google Scholar] [CrossRef]
Zhong, Y.; Teng, Z.; Tong, M. LightMixer: A Novel Lightweight Convolutional Neural Network for Tomato Disease Detection. Front. Plant Sci. 2023, 14, 1166296. [Google Scholar] [CrossRef] [PubMed]
Shen, J.; Zhang, L.; Yang, L.; Xu, H.; Chen, S.; Ji, J.; Huang, S.; Liang, H.; Dong, C.; Lou, X. Testing a Method Based on an Improved UNet and Skeleton Thinning Algorithm to Obtain Branch Phenotypes of Tall and Valuable Trees Using Abies Beshanzuensis as the Research Sample. Plants 2023, 12, 2444. [Google Scholar] [CrossRef]
Mo, L.; Shi, L.; Wang, G.; Yi, X.; Wu, P.; Wu, X. MISF: A Method for Measurement of Standing Tree Size via Multi-Vision Image Segmentation and Coordinate Fusion. Forests 2023, 14, 1054. [Google Scholar] [CrossRef]
Yi, X.; Wang, J.; Wu, P.; Wang, G.; Mo, L.; Lou, X.; Liang, H.; Huang, H.; Lin, E.; Maponde, B.T.; et al. AC-UNet: An Improved UNet-Based Method for Stem and Leaf Segmentation in Betula Luminifera. Front. Plant Sci. 2023, 14, 1268098. [Google Scholar] [CrossRef] [PubMed]
Wang, Q.; Shi, C.; Liu, D.; Jiang, G. Estimation of Body Weight in Captive Amur Tigers (Panthera tigris altaica). Integr. Zool. 2022, 17, 1106–1120. [Google Scholar] [CrossRef] [PubMed]
Wang, J.; Li, Y.; Feng, H.; Ren, L.; Du, X.; Wu, J. Common Pests Image Recognition Based on Deep Convolutional Neural Network. Comput. Electron. Agric. 2020, 179, 105834. [Google Scholar] [CrossRef]
Wang, F.; Wang, R.; Xie, C.; Yang, P.; Liu, L. Fusing Multi-Scale Context-Aware Information Representation for Automatic in-Field Pest Detection and Recognition. Comput. Electron. Agric. 2020, 169, 105222. [Google Scholar] [CrossRef]
Thenmozhi, K.; Srinivasulu Reddy, U. Crop Pest Classification Based on Deep Convolutional Neural Network and Transfer Learning. Comput. Electron. Agric. 2019, 164, 104906. [Google Scholar] [CrossRef]
Liu, Y.; Zhang, X.; Gao, Y.; Qu, T.; Shi, Y. Improved CNN Method for Crop Pest Identification Based on Transfer Learning. Comput. Intell. Neurosci. 2022, 2022, 9709648. [Google Scholar] [CrossRef]
Liu, J.; Wang, X.; Miao, W.; Liu, G. Tomato Pest Recognition Algorithm Based on Improved YOLOv4. Front. Plant Sci. 2022, 13, 814681. [Google Scholar] [CrossRef]
Li, R.; Wang, R.; Zhang, J.; Xie, C.; Liu, L.; Wang, F.; Chen, H.; Chen, T.; Hu, H.; Jia, X.; et al. An Effective Data Augmentation Strategy for CNN-Based Pest Localization and Recognition in the Field. IEEE Access 2019, 7, 160274–160283. [Google Scholar] [CrossRef]
Sun, L.; Cai, Z.; Liang, K.; Wang, Y.; Zeng, W.; Yan, X. An Intelligent System for High-Density Small Target Pest Identification and Infestation Level Determination Based on an Improved YOLOv5 Model. Expert. Syst. Appl. 2024, 239, 122190. [Google Scholar] [CrossRef]
Zhu, L.; Li, X.; Sun, H.; Han, Y. Research on CBF-YOLO Detection Model for Common Soybean Pests in Complex Environment. Comput. Electron. Agric. 2024, 216, 108515. [Google Scholar] [CrossRef]
Tang, Z.; Lu, J.; Chen, Z.; Qi, F.; Zhang, L. Improved Pest-YOLO: Real-Time Pest Detection Based on Efficient Channel Attention Mechanism and Transformer Encoder. Ecol. Inform. 2023, 78, 102340. [Google Scholar] [CrossRef]
Hu, X.; Li, X.; Huang, Z.; Chen, Q.; Lin, S. Detecting Tea Tree Pests in Complex Backgrounds Using a Hybrid Architecture Guided by Transformers and Multi-Scale Attention Mechanism. J. Sci. Food Agric. 2024, 104, 3570–3584. [Google Scholar] [CrossRef]
Gan, Y.; Guo, Q.; Wang, C.; Liang, W.; Xiao, D.; Wu, H. Recognizing crop pests using an improved EfficientNet model. Trans. CSAE (Chin.) 2022, 38, 203–211. [Google Scholar] [CrossRef]
Wei, M.; Zhan, W. YOLO_MRC: A Fast and Lightweight Model for Real-Time Detection and Individual Counting of Tephritidae Pests. Ecol. Inform. 2024, 79, 102445. [Google Scholar] [CrossRef]
Liang, J.; Tian, M.; Liu, X. Rapid Detection of Multi-Scale Cotton Pests Based on Lightweight GBW-YOLOv5 Model. Pest. Manag. Sci. 2024, 80, 2738–2750. [Google Scholar] [CrossRef]
Wang, F.; Huang, Y.; Huang, Z.; Shen, H.; Huang, C.; Qiao, X.; Qian, W. MRUNet: A two-stage segmentation model for small insect targets in complex environments. J. Integr. Agric. 2023, 22, 1117–1130. [Google Scholar] [CrossRef]
Zhang, W.; Xia, X.; Zhou, G.; Du, J.; Chen, T.; Zhang, Z.; Ma, X. Research on the identification and detection of field pests in the complex background based on the rotation detection algorithm. Front. Plant Sci. 2022, 13, 1011499. [Google Scholar] [CrossRef]
Liu, J.; He, C.; Jiang, Y.; Wang, M.; Ye, Z.; He, M. A High-Precision Identification Method for Maize Leaf Diseases and Pests Based on LFMNet under Complex Backgrounds. Plants 2024, 13, 1827. [Google Scholar] [CrossRef] [PubMed]
Wang, C.; Bochkovskiy, A.; Liao, H.-M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 7464–7475. [Google Scholar] [CrossRef]
Liu, Z.; Mao, H.; Wu, C.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 11966–11976. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 9992–10002. [Google Scholar] [CrossRef]
Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 13708–13717. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.; Kweon, I. CBAM: Convolutional Block Attention Module. In Proceedings of the Computer Vision—ECCV 2018, Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Wu, E. Squeeze-and-Excitation Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar] [CrossRef]
Li, H.; Li, J.; Wei, H.; Liu, Z.; Zhan, Z.; Ren, Q. Slim-Neck by GSConv: A Better Design Paradigm of Detector Architectures for Autonomous Vehicles. J. Real-Time Image Process. 2024, 21, 62. [Google Scholar] [CrossRef]
Wang, J.; Xu, C.; Yang, W.; Yu, L. A Normalized Gaussian Wasserstein Distance for Tiny Object Detection. arXiv 2022, arXiv:2110.13389. [Google Scholar] [CrossRef]
Li, Y.; Yao, T.; Pan, Y.; Mei, T. Contextual Transformer Networks for Visual Recognition. IEEE Trans. Pattern Anal. 2023, 45, 1489–1500. [Google Scholar] [CrossRef]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. Int. J. Comput. Vis. 2020, 128, 336–359. [Google Scholar] [CrossRef]
Qiang, J.; Liu, W.; Li, X.; Guan, P.; Du, Y.; Liu, B.; Xiao, G. Detection of Citrus Pests in Double Backbone Network Based on Single Shot Multibox Detector. Comput. Electron. Agric. 2023, 212, 108158. [Google Scholar] [CrossRef]
Zhao, H.; Huang, B.; Wang, H.; Yue, Y. Pest Identification Method in Complex Farmland Environment Based on Improved YOLO v7. Trans. Chin. Soc. Agric. Mach. (Chin.) 2023, 54, 246–254. [Google Scholar] [CrossRef]
Xu, C.; Yu, C.; Zhang, S.; Wang, X. Multi-Scale Convolution-Capsule Network for Crop Insect Pest Recognition. Electronics 2022, 11, 1630. [Google Scholar] [CrossRef]

Figure 1. Overall framework for the tea tree pest recognition method we proposed in this paper.

Figure 2. Example of tea tree pest images.

Figure 3. Examples of training samples after data enhancement.

Figure 4. Diagram of the TTPRNet model structure we proposed in this paper.

Figure 5. Structure of the ConvNeXt Block.

Figure 6. Diagram of the CA module structure.

Figure 7. Diagram of the GSConv structure.

Figure 8. Heat map visualization of different attentional mechanisms: (A) original image; (B) unfused attentional mechanisms; (C) CoT; (D) SE; (E) CBAM; (F) CA. The first to fourth rows are Aleurocanthus spiniferus, Sympiezomias citri Chao, Tawny beetle, and Locust, respectively.

Figure 9. Visualization of detected pests with their category and counts displayed.

Figure 10. Performance comparison before and after model improvements.

Figure 11. Detection results of different sizes of pests before and after model improvement.

Figure 12. Detection results of different complexity backgrounds before and after model improvement.

Figure 13. Comparison of performance of different models.

Figure 14. Recognition results of different models on images of golden tortoise.

Table 1. Information about tea tree pests.

Categories	Magnitude	Damage	Number
Aleurocanthus spiniferus	small	sap	318
Locusts	medium	leaf	100
Anomala corpulenta Motschulsky	large	leaf	370
Sympiezomias citri Chao	medium	leaf	100
Holotrichia parallela	medium	leaf	153
Tawny beetle	medium	root	137
Miridiba sinensis	large	root	100
Apolygus lucorum	small	sap	100
Erthesina fullo	small	sap	187
Apriona germari	large	branch	173
Anoplophora chinensis	large	branch	159
Gryllidae	medium	root	306

Table 2. The performance of the YOLO series on tea tree pests.

Model	Parameters	FLOPs (G)	Size	FPS	mAP (%)
YOLOv5s	7,093,209	16.60	640	136.60	90.2
YOLOv5m	21,100,857	50.70	640	84.70	91.3
YOLOX-tiny	5,036,067	15.24	640	111.41	71.7
YOLOXs	8,941,939	26.78	640	110.26	72.7
YOLOv7	37,255,890	105.30	640	52.10	91.3
YOLOv8n	3,013,188	8.20	640	161.72	89.6
YOLOv8s	11,140,244	28.70	640	121.47	91.6
YOLOv7-tiny	6,044,754	13.30	640	145.70	88.8

Table 3. Experimental environment.

Parameter	Configuration
Operating System	Windows 10
CPU	Intel Core i5-13490F
GPU	RTX3060Ti (8 GB)
RAM	16 GB
Development environment	PyCharm2023.1.4
Languages	Python3.8.17
Framework	PyTorch1.10.2
Operational platform	CUDA11.3 CUDNN8.2.1.32

Table 4. Performance comparison of the improved YOLOv5s, YOLOv8n, and YOLOv7-tiny models.

Model	P (%)	R (%)	mAP (%)	Parameters	FLOPs (G)	FPS
IP-YOLOv5s	79.7	82.5	91.4	6,596,153	15.6	133.51
IP-YOLOv8n	78.6	88.1	90.7	2,827,628	7.6	152.50
TTPRNet (ours)	80.9	91.0	92.8	5,671,890	12.1	184.60

Table 5. Experimental results of CNeB modules with different embedding positions.

Model	P (%)	R (%)	mAP (%)	FLOPs (G)	Parameters
YOLOv7-tiny	80.1	83.9	88.8	13.3	6,044,754
YOLOv7-tiny+ CNeB-backbone	78.4	81.5	85.3	11.3	5,202,194
YOLOv7-tiny+ CNeB-neck	73.4	81.4	86.0	12.7	6,026,642
YOLOv7-tiny+ CNeB-14ELAN	80.5	82.8	87.0	12.3	5,395,858
YOLOv7-tiny+ CNeB-23ELAN	79.5	85.9	89.8	12.3	5,851,090

Table 6. Results of the attention mechanism ablation experiment.

Method	P (%)	R (%)	mAP (%)	Parameters	FPS
YOLOv7-tiny	80.1	83.9	88.8	6,044,754	145.7
+CoT	80.9	82.0	88.8	7,380,690	162.2
+SE	87.2	84.9	89.6	6,063,698	171.8
+CBAM	84.3	84.8	90.3	6,064,094	143.1
+CA	81.8	90.3	91.8	6,063,154	177.0

Table 7. Results of the ablation experiments (A, B, C, and D denote the addition of different modules in YOLOv7-tiny: A, the CNeB module; B, the CA module; C, GSConv; D, the NWD loss function).

Method	P (%)	R (%)	mAP (%)	Parameters	FPS
YOLOv7-tiny	80.1	83.9	88.8	6,044,754	145.7
YOLOv7-tiny+A	79.5	85.9	89.8	5,851,090	159.0
YOLOv7-tiny+B	81.8	90.3	91.8	6,063,154	177.0
YOLOv7-tiny+C	76.5	89.2	89.9	5,847,154	174.5
YOLOv7-tiny+D	82.9	88.9	90.6	6,044,754	177.9
YOLOv7-tiny+A+B	85.7	84.5	91.1	5,869,490	150.7
YOLOv7-tiny+A+B+C	86.8	82.8	91.5	5,671,890	147.3
YOLOv7-tiny+A+B+C+D	80.9 (+0.8)	91 (+7.1)	92.8 (+4)	5,671,890	184.6 (+38.9)

Table 8. Results of pest counting across different background complexities.

Background Complexity	Number of Images	Actual Pest Count	Detected Pest Count	Accuracy	Miss Detection Rate	False Detection Rate
Simple	46	46	46	95.65%	4.35%	4.35%
Medium	46	78	75	94.87%	5.13%	1.33%
Complex	46	117	124	95.73%	4.27%	9.68%

Table 9. Comparison of detection effect on different sizes of pests before and after model improvement.

Model	P (%)			R (%)			AP (%)
Model	Small	Medium	Large	Small	Medium	Large	Small	Medium	Large
YOLOv7-tiny	72.3	77.4	89.3	80.0	90.4	78.9	82.1	94.0	87.3
TTPRNet (ours)	71.1	77.5	92.4	84.8	92.7	93.4	83.9	95.3	96.4

Table 10. Comparison of detection effect on different complexity backgrounds before and after model improvement.

Model	P (%)			R (%)			AP (%)
Model	Simple	Medium	Complex	Simple	Medium	Complex	Simple	Medium	Complex
YOLOv7-tiny	75.8	87	73.5	88.8	84.4	80.7	87.8	94.2	84.7
TTPRNet (ours)	88.1	84	70.1	90.2	93.6	85.1	94.6	95.8	85.1

Table 11. Comparison of the performance of different models for the detection of tea tree pests.

Model	P (%)	R (%)	mAP (%)	Parameters	FLOPs (G)	FPS
SSD	79.2	67.5	70.7	25,082,528	61.85	25.21
CenterNet	89.0	53.7	67.7	32,665,432	70.21	63.30
EfficientDet	79.0	52.1	63.4	3,834,437	4.80	36.53
YOLOv5s	84.6	89.0	90.2	7,093,209	16.6	136.60
YOLOv5m	80.6	84.2	91.3	21,100,857	50.7	84.70
YOLOX-tiny	69.9	83.4	71.7	5,036,067	15.24	111.41
YOLOXs	88.5	71.2	72.7	8,941,939	26.78	110.26
YOLOv7	86.6	86.4	91.3	37,255,890	105.3	52.10
YOLOv8n	80.6	86.3	89.6	3,013,188	8.20	161.72
YOLOv8s	87.8	84.8	91.6	11,140,244	28.70	121.47
TTPRNet (ours)	80.9	91.0	92.8	5,671,890	12.1	184.60

Table 12. Comparative results with detection models from previous studies.

Authors	Base Model	mAP (%)	FPS	Single Image Detection Time (s)
Qiang et al. [42]	SSD	86.01	52.0	-
Zhao et al. [43]	YOLOv7	88.20	27.3	-
Xu et al. [44]	CNN	91.4	-	0.26
Ours	YOLOv7-tiny	92.8	184.6	0.19

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, Y.; Chen, T.; Xia, F.; Feng, H.; Ruan, Y.; Weng, X.; Weng, X. TTPRNet: A Real-Time and Precise Tea Tree Pest Recognition Model in Complex Tea Garden Environments. Agriculture 2024, 14, 1710. https://doi.org/10.3390/agriculture14101710

AMA Style

Li Y, Chen T, Xia F, Feng H, Ruan Y, Weng X, Weng X. TTPRNet: A Real-Time and Precise Tea Tree Pest Recognition Model in Complex Tea Garden Environments. Agriculture. 2024; 14(10):1710. https://doi.org/10.3390/agriculture14101710

Chicago/Turabian Style

Li, Yane, Ting Chen, Fang Xia, Hailin Feng, Yaoping Ruan, Xiang Weng, and Xiaoxing Weng. 2024. "TTPRNet: A Real-Time and Precise Tea Tree Pest Recognition Model in Complex Tea Garden Environments" Agriculture 14, no. 10: 1710. https://doi.org/10.3390/agriculture14101710

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

TTPRNet: A Real-Time and Precise Tea Tree Pest Recognition Model in Complex Tea Garden Environments

Abstract

1. Introduction

2. Materials and Methods

2.1. Tea Tree Pest Dataset

2.1.1. Data Acquisition

2.1.2. Data Processing

2.2. TTPRNet Framework Overview

2.2.1. Reconfiguring the Backbone Network

2.2.2. Coordinate Attention

2.2.3. GSConv

2.2.4. Normalized Wasserstein Distance

3. Results

3.1. Experimental Environment Configuration

3.2. Evaluation Indicators

3.3. Improved Model Comparison Experiment

3.4. Ablation Experiments

3.4.1. Experiments on Different Embedding Positions of CNeB Modules

3.4.2. Attention Module Ablation Experiments

3.4.3. TTPRNet Ablation Experiments

3.4.4. Assessment of Counting Functionality for TTPRNet

3.5. Comparative Experiments

3.6. Comparison Experiments with Classical Detection Models

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI