Overview of Pest Detection and Recognition Algorithms

Guo, Boyu; Wang, Jianji; Guo, Minghui; Chen, Miao; Chen, Yanan; Miao, Yisheng

doi:10.3390/electronics13153008

Open AccessReview

Overview of Pest Detection and Recognition Algorithms

by

Boyu Guo

¹,

Jianji Wang

^1,*,

Minghui Guo

^1,2,

Miao Chen

¹,

Yanan Chen

¹ and

Yisheng Miao

^3,*

¹

National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, National Engineering Research Center for Visual Information and Applications, and Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University, Xi’an 710049, China

²

School of Software Engineering, Xi’an Jiaotong University, Xi’an 710049, China

³

Research Center of Information Technology, Beijing Academy of Agriculture and Forestry Sciences, Beijing 100097, China

^*

Authors to whom correspondence should be addressed.

Electronics 2024, 13(15), 3008; https://doi.org/10.3390/electronics13153008

Submission received: 14 June 2024 / Revised: 19 July 2024 / Accepted: 29 July 2024 / Published: 30 July 2024

(This article belongs to the Topic Artificial Intelligence Models, Tools and Applications)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Detecting and recognizing pests are paramount for ensuring the healthy growth of crops, maintaining ecological balance, and enhancing food production. With the advancement of artificial intelligence technologies, traditional pest detection and recognition algorithms based on manually selected pest features have gradually been substituted by deep learning-based algorithms. In this review paper, we first introduce the primary neural network architectures and evaluation metrics in the field of pest detection and pest recognition. Subsequently, we summarize widely used public datasets for pest detection and recognition. Following this, we present various pest detection and recognition algorithms proposed in recent years, providing detailed descriptions of each algorithm and their respective performance metrics. Finally, we outline the challenges that current deep learning-based pest detection and recognition algorithms encounter and propose future research directions for related algorithms.

Keywords:

smart agriculture; pest detection; pest recognition

1. Introduction

The issue of food scarcity has perennially posed a formidable challenge on a global scale. With the relentless expansion of the world’s populace, projected to approach 9.1 billion by the year 2050 [1], the imperative of securing food production assumes paramount significance worldwide. Pests, constituting a principal contributor to food wastage and a formidable threat to food security, possess the capacity to severely compromise crop yield and quality. Such repercussions can precipitate diminished food output, potentially engendering the misuse of pesticides, thereby exacerbating food safety concerns, precipitating fluctuations in food prices, and incurring economic losses. Consequently, the development of efficient technologies for pest detection and recognition assumes critical importance in preemptively mitigating and managing pest-induced damage, thus safeguarding food security, augmenting agricultural productivity, and preserving ecological equilibrium.

Traditional machine learning-based pest detection and recognition methods usually employ algorithms such as Support Vector Machines (SVM) [2,3,4,5,6,7,8,9], Decision Trees [8,10,11,12], Random Forests [13,14], K-Nearest Neighbors (KNN) [6,7,8,9,15,16,17], and Naive Bayes [6,7,12,17]. These algorithms often rely on manually designed feature extraction methods to detect and recognize pests, which require a significant amount of expertise and experience. However, manual feature extraction may prove insufficient in capturing the holistic features within pest images, consequently curtailing the precision and generalizability of detection and recognition algorithms. Moreover, since crop pest images are usually obtained in natural and real agricultural environments, the images often contain complex backgrounds and varying lighting conditions, which may affect the detection and recognition accuracy of traditional machine learning algorithms.

Deep learning algorithms, like Convolutional Neural Networks (CNNs) and Transformers, outperform traditional machine learning algorithms in efficiency and accuracy for pest detection and pest recognition tasks by automatically learning complex feature representations from vast amounts of data, enabling them to process high-dimensional data, improve generalization capabilities, and adapt to complex backgrounds and changing conditions. Currently, deep learning-based pest detection and recognition algorithms have been widely applied in the field of agricultural production. For instance, ref. [18] proposed an agricultural pest detection smartphone application, ref. [19] proposed a sticky paper pest counting trap that can be used anywhere in the field, and ref. [20] proposed a multi-agent vision system to support autonomous orchard spraying by detecting pests on fruit trees and analyzing their conditions.

The remaining sections of this paper are structured as follows. Section 2 introduces the methods and strategies we used to collect the literature. Section 3 introduces the neural network architectures commonly employed in the fields of pest detection pest recognition, as well as the metrics used for evaluating algorithm performance. Section 4 presents several widely used public datasets for pest detection and recognition. In Section 5, various deep learning-based pest detection and recognition algorithms developed over the past five years are introduced, including their implementation details, and comparative analysis of different algorithmic metrics is conducted. Section 6 outlines the encountered challenges and problems of deep learning-based pest detection and pest recognition algorithms and propose the future research directions of algorithms.

2. Literature Collection Methods

In this review article, we adopted a systematic approach, as shown in Figure 1, to collect relevant literature on pest detection and pest recognition algorithms. Our method included the following steps:

1.: Initial search: we conducted a comprehensive search in multiple academic databases, including Google Scholar, IEEE Xplore, and ArXiv, using keywords related to pest detection, pest recognition, and smart agriculture.
2.: Manual search: to ensure the inclusion of pioneering works and the latest publications that may not yet be indexed in the databases, we performed manual searches by reviewing references in key articles and checking relevant conference proceedings and journals.
3.: Screening process: The initial search yielded numerous articles. We screened these articles by reviewing titles and abstracts to exclude irrelevant studies. Full texts of the remaining articles were then reviewed to ensure they met our inclusion criteria: peer-reviewed journal articles and conference papers published between 2020 and 2024; articles had to be in English and provide substantial insights or advancements in the fields of pest detection or pest recognition using algorithms.

Figure 1. The workflow of collecting the related literature.

By following this systematic approach, we selected fifteen pest detection algorithms and fifteen pest recognition algorithms, aiming to ensure a comprehensive and up-to-date collection of the literature for our review.

3. Background

With the rapid advancement of deep learning technology, pest detection and recognition algorithms have achieved remarkable progress, playing a crucial role in enhancing crop protection efficiency and reducing the abuse of chemical pesticides. Current research on pest detection and recognition algorithms primarily focuses on two advanced architectures: CNN architecture and Transformer architecture. Each of these architectures offers unique advantages and are suited to specific application scenarios.

CNNs are renowned for their exceptional performance in image processing. CNNs effectively capture local features within images and progressively build more complex and abstract feature representations. This capability makes CNNs particularly effective in tasks involving the detection and recognition of pests. By stacking deep convolutional layers, CNNs can automatically learn the most efficient ways to extract useful information from pest images, thereby achieving high-performance detection and recognition of pests.

On the other hand, Transformer models, which are based on attention mechanisms, exhibit revolutionary advantages in handling sequential data. Transformer models excel in managing long-range dependencies, making them uniquely advantageous for processing time-series data or image sequences. In the realm of pest detection and recognition, Transformer models can analyze the temporal variations in pest behavior, providing robust support for dynamic monitoring and early warning of pest activities.

3.1. CNN Architecture

A CNN, as shown in Figure 2, is a type of deep learning model that is widely used in many areas of computer vision, such as image recognition [21,22,23,24,25,26,27,28,29,30], object detection [31,32,33,34,35,36,37,38,39,40], image segmentation [41,42,43,44,45,46,47,48,49,50], etc.

Generally, constructing a CNN requires four indispensable components: convolutional layers, activation functions, pooling layers, and fully connected layers. Among these four components, the convolutional layer serves as the fundamental element of a convolutional neural network. It extracts local features, such as edges, textures, and shapes, and generates feature maps by sliding a set of convolutional kernels across the input image and performing convolution operations.

In recent years, with the rapid advancements in computer vision, numerous deep learning algorithms based on CNNs have been proposed. Currently, in tasks involving pest recognition and detection, the most commonly used deep learning algorithms are the Residual Network (ResNet) [51] for both pest detection and recognition, and You Only Look Once (YOLO) [34] for pest detection.

3.1.1. ResNet

In traditional theory, neural networks should exhibit enhanced expressive capabilities as the depth of the network layers increases. However, in practice, deep neural networks encounter significant challenges such as vanishing gradients and exploding gradients [52,53,54,55,56,57,58]. ResNet, proposed by Kaiming He, addresses these issues through the introduction of a residual learning framework for constructing deep networks.

Traditional deep neural networks attempt to learn the direct mapping from inputs to outputs, with connections between layers being sequential. Each layer receives input only from the previous layer and then outputting to the next layer. However, ResNet addresses this by introducing a residual learning framework, which enables it to learn the residual mapping between inputs and outputs. Specifically, this framework allows ResNet to learn the residual function instead of directly learning the desired underlying mapping.

Shortcut connections are the key mechanism enabling residual learning in ResNet. These connections allow the input signal to bypass one or multiple layers via an identity mapping and be directly passed to subsequent layers. The signal is then added to the output of those layers to form an

F (x) + x

structure, where

F (x)

represents the output of the previous layer, and x represents the input to the shortcut connection, as shown in Figure 3.

Shortcut connections directly link the shallow and deep layers of the ResNet, facilitating a more direct flow of gradients during the training process. This helps mitigate the vanishing gradient and exploding gradient problems faced by traditional deep neural networks. By alleviating these issues, ResNet can effectively train deeper networks than previously possible, significantly enhancing the model’s learning capacity and performance while making the training of deep networks more stable.

3.1.2. YOLO

Traditional CNN-based object detection algorithms, such as R-CNN [31], Fast R-CNN [32], and Faster R-CNN [33], offer high detection accuracy but are computationally intensive, making them unsuitable for real-time pest detection applications. In contrast, the YOLO framework addresses object detection as a regression problem, which is solved with a single forward pass through a convolutional neural network. YOLO simultaneously performs object localization and recognition across the entire image. By employing global feature representation, multi-scale prediction, and dividing the input image into a fixed grid, as shown in Figure 4, YOLO achieves efficient and effective real-time object detection.

Throughout its successive iterations, the YOLO series has incorporated numerous enhancements to improve its performance. Key improvements include the introduction of anchor mechanisms, batch normalization, and multi-scale prediction. The series has also integrated advanced feature extraction networks, such as CSPDarknet53 and Darknet-53, as well as sophisticated data augmentation techniques like Mosaic and CutMix. Additionally, YOLO has optimized its loss functions and training strategies, and utilized automated optimization tools and lightweight designs. These advancements have significantly increased YOLO’s detection accuracy, generalization capability, and inference efficiency, thereby ensuring its effectiveness and precision across a wide range of real-world applications.

3.2. Transformer Architecture

The Transformer [59] model is a deep learning architecture based on the attention mechanism, which was introduced to address issues of long-distance dependencies, limitations of parallel processing, and the balance between complexity and efficiency in sequence-to-sequence tasks in the field of natural language processing [60,61,62,63,64]. Compared to traditional recurrent neural networks (such as RNN [65] and LSTM [66]), Transformer completely abandons the recurrent structure and adopts the attention mechanism, as shown in Figure 5, enabling the model to handle the global dependencies of sequences to improve processing efficiency and effectiveness. The attention mechanism, which consists of self-attention mechanism and multi-head attention mechanism, is the core of the Transformer architecture.

The self-attention mechanism enables the Transformer to consider all elements in a sequence when processing each element, learn correlations at each position, and capture long-distance dependencies regardless of their distances. By calculating the attention each element has towards others and computing attention scores, the self-attention mechanism determines the importance of different parts of the sequence. These scores weight the input sequence representation to produce the weighted output. Consequently, the self-attention mechanism allows the model to integrate information from various positions and focus on their importance, achieving more flexible and effective modeling.

The multi-head attention mechanism extends the self-attention mechanism by partitioning the input into multiple “heads” and conducting self-attention operations simultaneously on each head. This approach enables the Transformer to capture diverse information aspects among inputs within distinct representational subspaces. Subsequently, the outputs from these heads are aggregated to form the final output. This mechanism enriches the model’s expressive capacity, empowering the Transformer to effectively manage intricate input sequences and handle a variety of sequence-to-sequence tasks, such as machine translation [67,68] and text summarization [69,70].

The Transformer model, initially designed for NLP tasks, has been extended to computer vision realms such as image recognition [71,72,73,74,75], object detection [76,77,78,79,80], image segmentation [81,82,83,84,85], and video processing [86,87,88,89,90]. The Transformer excels in capturing global dependencies among image patches through the attention mechanism, transcending the local perception of traditional CNNs. Positional encoding ensures incorporation of spatial relationships within images, crucial for tasks like object detection and segmentation. In more complex visual tasks, the Transformer adopts an encoder–decoder structure, with the encoder capturing inter-area relationships and the decoder generating task-specific results.

3.2.1. Vision Transformer

The Vision Transformer (ViT) [71] applied Transformers to image recognition by partitioning images into patches and linearly embedding them into sequences, akin to NLP word sequences processing, as shown in Figure 6.

The ViT first divides the input image into fixed-size non-overlapping patches. Each patch is flattened into a one-dimensional vector and then projected linearly to a fixed dimension space. To preserve the positional information of the patches, the ViT incorporates position embeddings. Each patch vector is combined with its corresponding positional information vector, creating a sequence of patch vectors that retain their positional context. These position-encoded patch vectors serve as the input to the Transformer Encoder. The self-attention mechanism captures relationships between patches globally, while the feed-forward network applies non-linear transformations. In the output of the last Transformer Encoder layer, a special classification token is typically added to represent the global feature of the entire image. This token is passed through a simple fully connected layer and a softmax layer for classification, predicting the image’s category.

3.2.2. Detection Transformer

The Detection Transformer (DETR) [76] represents the pioneering application of the transformer architecture to object detection tasks. It combines the transformer architecture with CNN to propose an innovative method for object detection. By encoding image features into sequences and introducing object queries, DETR can model the relationships between image features and objects on a global scale.

DETR’s overall architecture consists of a CNN feature extractor, a transformer Encoder–Decoder module, and a Feed-Forward Network (FFN), as shown in Figure 7. Initially, input images are processed through a CNN such as ResNet, extracting low-resolution feature maps that encapsulate crucial image details. These feature maps are flattened into a sequence and augmented with positional embeddings to preserve spatial information. The sequence is then fed into a Transformer Encoder, which uses Multi-Head Self-Attention and FFN to capture global relationships among features. A set of fixed, learnable object queries are introduced to represent potential targets. These queries, along with the Encoder’s output, are fed into a Transformer Decoder, which further processes them using Self-Attention and Cross-Attention mechanisms. The Decoder outputs feature representations for each object query, which are then passed through a Feed-Forward Network (FFN) to predict bounding boxes and class labels. Hungarian Matching is used to align the predictions with ground truth annotations, and the model is optimized using a loss function that includes classification and bounding box regression losses, enabling accurate target detection and localization.

3.3. Comparison of CNN and Transformer

The CNN uses convolutional layers to process input images. Convolution operations capture local spatial relationships, making them effective for object detection, image recognition and other computer vision tasks. The Transformer uses self-attention mechanisms to process input data. It treats images as sequences of patches, allowing it to capture long-range dependencies more effectively than the CNN. The CNN and the Transformer, due to their different principles and architectures, have their own strengths and weaknesses, as shown in Table 1.

CNN excels in real-time detection, and small object detection in object detection tasks; in image recognition tasks, it specializes in standard recognition tasks. Transformer excels in capturing global context, handling complex backgrounds, and multimodal data in object detection tasks; in image recognition, it excels in large-scale image recognition tasks.

3.4. Evaluation Metrics

In deep learning, evaluating the performance of an algorithm is crucial. A series of evaluation metrics are chosen to quantify model performance and compare the performance of different algorithms, on the purpose of comparing the performance of different deep learning algorithms.

3.4.1. Evaluation Metrics for Recognition

Accuracy

$A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}$

(1)
Precision

$P r e c i s i o n = \frac{T P}{T P + F P}$

(2)
Recall

$R e c a l l = \frac{T P}{T P + F N}$

(3)
F1-Score

$F 1 - S c o r e = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}$

(4)

In the four formulae above,

T P

represents the number of samples correctly predicted as positive by the model;

T N

represents the number of samples correctly predicted as negative by the model;

F P

represents the number of negative samples incorrectly predicted as positive by the model;

F N

represents the number of positive samples incorrectly predicted as negative by the model.

3.4.2. Evaluation Metrics for Detection

In addition to the four metrics mentioned above, there are also three commonly used evaluation metrics in object detection tasks, Average Precision (

A P

), and mean Average Precision (

m A P

).

Average Precision

$A P = \int_{0}^{1} P r e c i s i o n (R e c a l l) d (R e c a l l)$

(5)
mean Average Precision

$m A P = \frac{1}{| C |} \sum_{c \in C} A P_{c}$

(6)

4. Datasets

Due to the complexity of collecting pest images, many researchers, aside from those specifically investigating pests on certain crops, use public datasets such as the IP102 dataset [91] and the D0 dataset [92], which are used for both pest detection and pest recognition, and the Pest24 dataset [93] for pest detection. The D0 dataset is a pest dataset containing approximately 4500 images of pests across 40 species. Compared to the D0 dataset, the IP102 dataset and Pest24 dataset present more challenging tasks for pest detection and recognition, which will be introduced in this section.

4.1. IP102 Dataset

IP102 is a large-scale pest dataset designed for research in pest detection and recognition. The IP102 dataset consists of 75,222 pest images from 102 different categories commonly found in agricultural environments, as shown in Figure 1. It is split into training, validation, and test sets in a 6:1:3 ratio.

IP102 is the most widely used large-scale pest dataset to date, which includes a rich variety of pest categories and a sufficient number of samples. The extensive samples in the IP102 dataset can help researchers solve many complex problems in practical applications. Some of these problems are listed below:

The appearance of the same pest varies greatly across different life stages, as shown in Figure 8.
Different species of pests have indistinguishable similar appearances under certain specific behaviors, as shown in Figure 9.
Recognizing pest categories based on pest traces, as shown in Figure 10.

4.2. Pest24 Dataset

The Pest24 dataset contains 25,378 JPG images of 24 categories of field crop pests specified for monitoring by the Ministry of Agriculture of China, all with a resolution of

2095 \times 1944

pixels. The dataset is divided into training, validation, and test sets in a 5:2:3 ratio.

The Pest24 dataset has the following two significant features that present new challenges for deep learning-based pest detection:

1.: Pest sizes are extremely small, as shown in Figure 11.
2.: Some images exhibit pests that are densely distributed and adhered to each other, as shown in Figure 12.

Figure 11. This figure shows extremely small pests in the Pest24 dataset. (a) shows the small pests: Stem borer, Little gecko, and Plutella xylostella. (b) shows the small pest: Bollworm. (c) shows the small pests: Spodoptera litura, Stem borer, Plutella xylostella, and Nematode trench. (d) shows the small pests: Nematode trench.

Figure 12. This figure shows the images that contains densely distributed and adhered pests. (a) shows the densely distributed and adhered Rice leaf rollers, Striped rice borers, Bollworms, etc. (b) shows the densely distributed and adhered Bollworms, Stem borers, Agriotes fuscicollis Miwas, etc. (c) shows the densely distributed and adhered Athetis lepigones, Stem borers, Agriotes fuscicollis Miwas, etc. (d) shows the densely distributed and adhered Nematde trenches, Agriotes fuscicollis Miwas, Melahotuses, etc.

5. Review of Algorithms

In this section, this paper delves into the latest advancements in the fields of pest detection and pest recognition. Specifically, it provides a detailed description of algorithms proposed in recent years and compares the metrics and performance of these algorithms.

5.1. Pest Detection Algorithms

Table 2 presents a comparative analysis of the performance of 15 recent pest detection algorithms across different datasets. The results clearly demonstrate that the algorithm based on YOLO consistently outperforms others. Specifically, on the IP102 dataset, the YOLOv7-based algorithm, proposed in 2023, achieves an mAP of 76.3% and an F1-score of 75.2%, showcasing its efficacy in feature extraction and pest detection.

For the Pest24 dataset, while the hybrid model of ResNet-50, YOLOv3, and Transformer demonstrates improvements in mAP and Recall, the algorithm based on GhostNet and improved YOLOv5, combined with the channel attention mechanism, achieves the highest performance with an mAP of 74.1%. This underscores the effectiveness of lightweight networks and attention mechanisms in pest detection tasks.

It is also worth noting the success of ensemble learning methods on the D0 dataset, which achieved an mAP of 99.8% and an F1 score of 99.7%, emphasizing the potential of model fusion in enhancing detection performance. Additionally, the performance of some algorithms on self-collected datasets is noteworthy. For instance, a hybrid model of improved YOLOv8 and Transformer achieved an mAP of 98.17% and an F1-score of 97.13% on the dataset consisting of 2864 images, demonstrating adaptability and generalization capabilities.

The experimental results presented in the table underscores the strengths of various algorithms on different datasets and their potential applications in pest detection tasks. The following are detailed descriptions of each algorithm.

5.1.1. Algorithms on the IP102 Dataset

Ref. [94] proposed an ensemble method of CNNs based on different topologies (EfficientNetB0, ResNet-50, GoogleNet, ShuffleNet, MobileNetv2, and DenseNet201) and introduced two novel DGrad-based variants of the Adam optimization algorithm (Exp and ExpLR) to optimize the training process of the proposed model.

Ref. [95] proposed an approach for pest detection by deeply mining information within the feature maps of convolutional neural networks. This approach utilizes class activation mapping technique to generate activation maps of target classes during the forward propagation process of the model, and improves the accuracy of detection tasks by designing a localization loss function that guides the model to focus on hotspot areas in the activation map, thereby capturing key regions of the target.

Ref. [96] utilized the CSPResNeXt-50 module and the VoVGSCSP module to replace the original ELAN module and ELAN-W module in YOLOv7, respectively, which simplifies the model structure, reduces the number of parameters and computational load, and improves detection accuracy of maize pests.

The comparison of the mAP metrics for the models proposed in [94,95,96] is shown in Figure 13a. The respective strengths and weaknesses of these models are listed in Table 3.

5.1.2. Algorithms on the Pest24 Dataset

Ref. [97] integrated the Squeeze-and-Excitation attention mechanism module into the CNN for image data mining, key feature extraction, and suppression of irrelevant features. Additionally, a Cross-Stage Multi-Feature Fusion method was designed to improve the structure of the feature pyramid network and path aggregation network, enhancing the feature representation of small objects.

Ref. [98] incorporated the Efficient Channel Attention mechanism to enhance feature extraction capability and avoid the downsampling issue present in the traditional SE attention mechanism, thereby capturing cross-channel interactions more effectively. It also integrated the transformer encoder into the CNN architecture, improving the model’s ability to capture global contextual information. Additionally, it utilized the Cross-Stage Feature Fusion (CSFF) method, significantly enhancing the feature representation of small targets, such as agricultural pests, during the feature fusion stage.

Ref. [99] employed the GhostNet as the backbone, enhancing its feature extraction capabilities by integrating the Efficient Channel Attention mechanism. High-resolution feature maps were introduced in the Bidirectional Feature Pyramid Network (BiFPN), and the addition of horizontal residual connections enhanced the model’s ability to detect small pests. This enriched the data flow paths and highlights small target features. These improvements significantly enhanced the detection accuracy and speed of the proposed model.

The comparison of the mAP metrics for the models proposed in [97,98,99] is shown in Figure 13b. The respective strengths and weaknesses of these models are listed in Table 4.

5.1.3. Algorithms on Self-Collected Datasets

Ref. [100] introduced a context-aware attention network that preliminarily recognizes pest images into different crop categories by extracting multi-scale contextual information from the images as prior knowledge. It also proposed a Multi-Projection Pest Detection Model (MDM), which generates super-resolution features for pest detection by combining small-scale contextual information from lower convolutional layers with information from higher convolutional layers. Additionally, the model employed attention mechanisms and data augmentation techniques to further enhance the effectiveness of in-field pest detection.

Ref. [101] developed a Deformable Residual Network Module (DRB-Net), which utilizes deformable convolutions to extract multi-scale and deformable features of pests, enhancing the ability to model geometric transformations of pests. Then, a Global Context-Aware Module (GCF) was proposed, which captures global contextual features of the image through global pooling and subsequent fully connected layers, combining them with local features to improve the accuracy of pest recognition. Finally, a FPN was employed for multi-scale feature extraction, enhancing the model’s ability to recognize pests of different sizes.

Ref. [102] introduced the SWin Transformer (SWinTR) and Transformer (C3TR) mechanisms to capture more global features and expand the receptive field. It employed ResSPP to enhance the backbone network’s feature extraction capabilities. During the feature fusion stage, it transformed the C3 output neck into SWinTR to extract and convey global features. The paper also introduced WConcat to enhance feature fusion capabilities, allowing different feature maps to have different weights during fusion.

Ref. [103] employed Deformable Convolutions (DCNv3) in the feature extraction network to accommodate the diversity and complexity of pest images. It integrated the biformer dynamic attention mechanism into the feature fusion network, enhancing the model’s ability to capture feature information. A new implicit decoupled head was designed at the output end to optimize the accuracy of the prediction results. Additionally, the soft-NMS algorithm was applied in the prediction module to effectively address issues of multiple detections and missed detections.

Ref. [104] introduced DenseNet blocks and an Adaptive Attention Module (AAM) in the image feature extraction part, significantly improving feature map utilization, reducing information loss, and enhancing the model’s ability to efficiently use feature representations. Moreover, by designing a feature fusion network, the model effectively integrated the feature extraction and feature aggregation paths, allowing deep networks to utilize spatial location information from shallow networks, further improving detection accuracy. Finally, the model implemented a multi-scale prediction module, enabling the prediction of bounding boxes and classes on feature maps of different resolutions, which effectively enhances the detection of small objects and improves overall detection accuracy.

Ref. [105] introduced the ECA mechanism, which enhanced the network’s ability to capture detailed feature information of small targets, enlarged the local perception field, and integrated multi-scale features, thereby improving the model’s detection accuracy of small targets. It adopted the Bidirectional Feature Pyramid Network (BiFPN), which strengthened the fusion of high and low-level feature information through hierarchical connections, enhancing the model’s ability to detect targets of different scales. The paper also improved Non-Maximum Suppression (NMS) with the Distance-IoU-based NMS (DIOU_NMS) algorithm, which considered distance, overlap, and scale effects between predicted and target boxes. This enhancement effectively reduced missed detections in dense target scenarios, further improving detection recall and accuracy.

Ref. [106] introduced a new Point-Line Distance IoU Loss (PLDIoU Loss), which simplified distance calculations between candidate and true bounding boxes, reducing redundant computations, speeding up localization, and improving accuracy. To enhance the model’s feature extraction capability for target objects, the paper integrated a Convolutional Block Attention Module (CBAM) into the network, which adaptively focused on target objects in both channel and spatial dimensions, thereby improving detection and recognition accuracy. Additionally, it adopted the mixup online data augmentation algorithm, which extended the training dataset by blending different images, to enhance model generalization and robustness while preventing overfitting. These improvements significantly enhanced pest detection performance while keeping the model lightweight.

Ref. [107] proposed MAM-IncNet, a novel network architecture that substitutes the convolutional layers in the front-end of the SSD detector with optimized Inception modules (M-Inception). It leveraged a pre-trained VGG16 backbone to enhance feature extraction capabilities. The paper proposed a hybrid attention mechanism integrating both channel attention and spatial attention to enhance the model’s focus on pertinent information while filtering out irrelevant noise. This approach facilitated better utilization of inter-channel relationship features and improves learning of spatial feature importance. Furthermore, the study employed transfer learning by initializing network parameters with weights pre-trained on extensive image datasets, addressing data scarcity challenges and enhancing learning efficiency.

Ref. [108] introduced the Sliced Aided Fine-tuning and Hyper Inference (SAHI) method, which enhanced the model’s performance on low-resolution images and small object detection by slicing the input image. By dividing the input image into multiple slices, this approach provided a larger pixel area for small objects, thereby improving the network’s inference and fine-tuning effectiveness while offering more detailed features for subsequent models. The paper also proposed the Generalized Efficient Layer Aggregation Network (GELAN) to replace the C2f module in the original YOLOv8 model, simplifying the network structure and enhancing feature extraction capabilities, resulting in a lightweight model. Furthermore, a BiFormer attention mechanism based on the Transformer architecture was introduced to enhance the features of tea pest targets. Additionally, the YOLOv8 neck network was integrated with an MS structure for feature fusion, improving the extraction of both fine-grained and coarse-grained semantic information.

The comparison of the mAP metrics for the models proposed in [100,101,102,103,104,105,106,107,108] is shown in Figure 14. The respective strengths and weaknesses of these models are listed in Table 5.

5.2. Pest Recognition Algorithms

Table 6 offers a comprehensive comparison of the performance metrics of 15 pest recognition algorithms from recent years on several datasets.

On the IP102 Dataset, the ResNet model combined with feature fusion, propose in 2024, achieved an accuracy of 68.34% and an F1-Score of 68.34%. Additionally, the ensemble learning model proposed in 2020 demonstrated robust performance with an accuracy of 67.13% and an F1-Score of 65.76%. Notably, the use of multi-image fusion in conjunction with ResNet-50, proposed in 2023, resulted in the highest accuracy of 96.1% and the highest F1-Score of 95.9% on the same dataset, indicating the potential of ResNet models and multi-image fusion technologies for pest recognition tasks.

Experimental results on the D0 dataset showed extremely high accuracy rates, with the ensemble learning model proposed in 2020 achieving an accuracy and an F1-Score of 98.81%. Furthermore, the ResNet-50 model with multi-image fusion fusion excelled with an accuracy of 100% and an F1-Score of 100%, underscoring the effectiveness of feature engineering in enhancing model performance.

On different self-collected datasets, the algorithms also exhibited commendable results. For instance, the model based on GoogleNet, proposed in 2020, achieved an accuracy of 96.67%, while the model based on MobileNetv2, proposed in 2021, reached an accuracy of 99.14%. Additionally, the model based on ResNet-50 combined with the self-attention mechanism, proposed in 2024, achieved an accuracy of 99.80%. These results highlight the capability of these models to generalize well to new data and showcase the diversity of approaches and their respective outcomes. Furthermore, the use of attention mechanisms, such as parallel attention and coordinate attention, has also been noted to contribute to the accuracy of pest recognition models.

The experimental results shown in the table highlights advantages of various algorithms on different datasets and their potential applications in pest recognition tasks. Below are comprehensive descriptions of every algorithm.

5.2.1. Algorithms on the IP102 Dataset

Ref. [109] proposes a fused residual block for pest recognition. Building upon the original residual block, this fused residual block is capable of integrating features from the previous layer between two

1 \times 1

convolutional layers, which improves the capacity of the residual block. Additionally, a deep feature fusion residual network (DFF-ResNet) is constructed by stacking feature residual blocks.

Ref. [110] proposed a weighted ensemble method based on a genetic algorithm (GAEnsemble), which determined the model weights by considering success rate and prediction stability. This ensemble method integrated seven different pre-trained CNN models (VGG-16, VGG-19, ResNet-50, Inception-V3, Xception, MobileNet, and SqueezeNet) and improved the accuracy and robustness of pest recognition.

Ref. [111] proposed a multi-scale attention learning network (MS-ALN) for pest recognition, which recursively located discriminative regions and learns region-based feature representations through four branches. MS-ALN consisted of three modules: the Target Localization Module (TLM), the Attention Detection Module (ADM), and the Attention Removal Module (ARM). The TLM filtered the background and located the target areas; the ADM detected high-response regions and guided the network to learn finer-grained features; the ARM randomly removed distinguishing regions to enhance the model’s robustness to occlusion.

Ref. [112] utilized the EfficientNetV2 as the backbone network and introduces a Coordinate Attention mechanism to learn pest information and location details from the input images. Additionally, a feature fusion module was developed by combining feature maps from the Mobile Inverted Bottleneck outputs, and the average pooling outputs, achieving the integration of shallow and deep features and addressing the issue of pest feature loss during downsampling.

Ref. [113] combined the EfficientNetV2 network with transfer learning and progressive learning for pest recognition. The proposed model adopted a progressive learning mechanism, gradually expanding the network topology during the training process, which enhances the model’s learning capability and interpretability, overcoming the limitations of transfer learning.

Ref. [114] proposed an ensemble-based transfer learning model. This ensemble model incorporated pre-trained models such as VGG-16, VGG-19, and ResNet-50, each equipped with a voting classifier ensemble technique. These pre-trained models were applied in parallel pipeline models on the training dataset and combined with the voting classifier to generate the final predictions for the samples.

Ref. [115] proposed a multi-image fusion recognition approach based on ResNet-50 for recognizing the same species of pests across multiple images. This approach utilized an Efficient Feature Localization Module to aggregate the feature map outputs from all blocks in the final stages of the CNN. It recognized regions with high activation values as pest locations and cropped these features to obtain localization features. The Adaptive Filtering Fusion Module was used to learn gating masks and selection masks to eliminate the interference of useless information. Additionally, this model employed attention mechanisms to select and fuse beneficial features.

Ref. [116] proposed the Visual Regeneration Fusion Network for pest recognition. This model integrated multi-scale features from the Global Feature Extraction network and the Visual Regeneration network, capturing semantic details for pest recognition and enhancing accuracy. Additionally, this paper proposed a patch-based augmentation approach, which could effectively simulate environments where insects were partially obscured, thereby improving robustness.

The comparison of the Accuracy metrics for the models proposed in [109,110,111,112,113,114,115,116] is shown in Figure 15. The respective strengths and weaknesses of these models are listed in Table 7.

5.2.2. Algorithms on Self-Collected Datasets

Ref. [117] compared five common CNN networks (VGG-16, VGG-19, ResNet-50, ResNet-152, and GoogLeNet), and proposed a network suitable for pest recognition in natural environments, by fine-tuning the GoogLeNet model. This model demonstrated remarkable performance in the task of recognizing the following pests: Cydia pomonella, Gryllotalpa, Leafhopper, Locust, Oriental fruit fly, Pieris rapae Linnaeus, Snail, Spodoptera litura, Stinkbug, and Weevil.

Ref. [118] integrated spatial and channel attention mechanism and classification activation map into the MobileNetv2 to learn significant pest information from the input images. Additionally, an optimized loss function and a two-stage transfer learning method were adopted during model training. This progressive learning approach first enabled the model to recognize the large-scale structures in the images, and then gradually shifted its focus to finer details, thereby improving the accuracy of pest recognition. This model demonstrated superb performance in the task of recognizing the following pests: Cydia pomonella, Gryllotalpa, Leafhopper, Locust, Oriental fruit fly, Pieris rapae Linnaeus, Snail, Spodoptera litura, Stinkbug, and Weevil.

Ref. [119] constructed a Multi-Scale Convolution-Capsule Network (MSCCN) composed of multi-scale convolution modules, capsule network modules, and softmax classification modules. The MSCCN integrated the advantages of CNN, CapsNet, and multi-scale CNN to learn robust features from pest images at different scales, thereby enabling pest recognition. This model demonstrated outstanding performance in the task of recognizing the following pests: Rice leaf roller, Rice leaf caterpillar, Asiatic rice borer, Yellow rice borer, Rice gall midge, Rice Stemfly, Rice water weevil, Rice leafhopper, and Rice shell pest.

Ref. [120] proposed a Parallel Spatial and Channel Attention mechanism (PSCA) that combines spatial and channel attention, and integrated it into ResNet-50 to construct the ResNet-50-PSCA network. This model demonstrated remarkable performance in the task of recognizing the following pests: Aphidoidea, Cabbage butterfly, Drosophila, Gryllotalpa, Leafhopper, Locust, Snail, Stinkbug, Weevil, and Whitefly.

Ref. [121] proposed the ITF-WPI cross-model feature fusion model, which included two components: CoTN and ODLS, used for parallel processing of images and text, respectively. CoTN utilized the Transformer structure (CoT) and Pyramid Squeezed Attention (PSA) mechanism, focusing on the extraction of contextual features, and enhanced the extraction of multi-scale feature structure information through PSA. The ODLS network, which combined 1D convolution and bidirectional LSTM stacking, possessed a more robust capability for text feature acquisition compared to other advanced CNN-LSTM models. This model demonstrated remarkable performance in the task of recognizing the following wolfberry pests: Geometridae, Cicadella viridis, Crioceridae, Elthemidea sp, Membracidae, Mylabris speciosa Pallas, Tropidothorax elegans distant, Cerambycidae, Nephrotoma sp, Thripidae, Epitri abeillei, Bedbug, Tephritidae, Agrotis ypsilon, Adelgoidea, Plodia interpunctella, and Carposinidae.

Ref. [122] proposed a multimodal transformer model (MMFGT) that combined self-supervised learning and fine-grained recognition. The model introduced contrastive learning to extract target features and utilized a part selection module (PSM) to focus attention on key areas of the image, thereby improving the recognition accuracy of small pest targets. Additionally, the model integrated multimodal information from images and natural language descriptions, enhancing feature representation through the ALBERT text encoder. These comprehensive improvements significantly enhance the performance and accuracy of pest recognition. This model demonstrated superb performance in the task of recognizing the following pests: Colposcelis signata, Piezodorus rubrofasciatus, Riptortus pedestris, Eysacoris guttiger, Erthesina fullo, Membracidae, Acrida cinerea, Tingidae, Oxya, Scurelleridae, Spoladea recurvalis, Cletus schmidti Kiritshenko, Ascotis selenaria Schiffermuller et Denis, Helicoverpa armigera, Berytidae, Taiwania, Aphidoidea, Eurygaster testudinarius, Spodoptera frugiperda, Trigonotylus ruficornis Geoffroy, Riptortus linearis Fabricius, Rhopalosiphum maidis, Pygmy sand cricket, Atractomorpha sinensis Bolivar, Tropidothorax elegans Distant, Cletus punctiger Dallas, Dolycoris baccarum, Nysius ericae, and Longhorned grasshoppers.

Ref. [123] combined the self-attention mechanism with the ResNet architecture to propose a novel pest recognition method. By introducing two parallel self-attention branches into ResNet, the model was able to more accurately extract and enhance key features in the images, thereby significantly improving the accuracy of pest recognition. Additionally, through data augmentation techniques and carefully tuned hyperparameters, this method further enhanced the model’s generalization ability and performance, achieving efficient recognition of various pests in complex agricultural settings. This model demonstrated extraordinary performance in the task of recognizing the following pests: Aphidoidea, Spodoptera frugiperda, Coleoptera, Pectinophora gossypiella, Gomphocerinae, Acariformes, Culicidae, Symphyta, and Scirpophaga incertulas.

The comparison of the Accuracy metrics for the models proposed in [117,118,119,120,121,122,123] is shown in Figure 16. The respective strengths and weaknesses of these models are listed in Table 8.

6. Challenges and Future Research Directions

Pest detection and recognition algorithms are crucial components of smart agriculture, playing a vital role in ensuring the healthy growth of crops, maintaining ecological balance, and enhancing food production. Despite significant progress in recent years, there are still several challenges and future research directions to explore. Based on the performance analysis of deep learning-based pest detection and recognition algorithms on various datasets presented in this paper, this section summarizes the challenges encountered by these algorithms and proposes directions for future research in both realms.

6.1. Challenges

6.1.1. Complex Agricultural Environments

In real agricultural environments, pest detection and recognition algorithms need to handle complex factors such as varying lighting conditions, background noise, and crop diversity. These factors pose challenges to the accuracy and robustness of algorithms. The experimental results of different algorithms on the IP102 dataset, the D0 dataset, and self-collected datasets, as shown in Table 2 and Table 6, demonstrated that although deep learning models excel in controlled environments, their effectiveness diminishes in field conditions.

6.1.2. Variability in Pest Appearance

The morphological changes in pests during their growth, such as variations in body length, color, and texture, significantly increase the complexity of detection and recognition algorithms. These algorithms must adapt to these changes to accurately detect and recognize pests at different lifecycle stages, including larvae, pupae, and adults. The experimental results of different algorithms on the IP102 dataset and self-collected datasets, as shown in Table 2 and Table 6, demonstrate that the performance of pest detection and recognition algorithms declines when encountering pests with morphological changes.

6.1.3. Small and Densely Distributed Pests

The detection of small pests, which are often densely distributed, poses significant challenges in high-resolution images due to their small size, an issue not fully addressed by existing algorithms, leading to limited detection accuracy. The experimental results of different algorithms on the Pest24 dataset and self-collected datasets, as shown in Table 2, demonstrated that pest detection algorithms have difficulties and challenges in detecting densely distributed small pests.

6.2. Future Research Directions

6.2.1. Directions in Pest Detection

Optimization of deep learning models: For the complex agricultural environments and variability in pest appearance issues in pest detection tasks, these problems can be addressed by continuously optimizing deep learning models and enhancing their feature extraction capabilities. Ref. [96] proposed an improved YOLOv7 model where the original ELAN module and ELAN-W module were replaced by the CSPResNeXt-50 module and the VoVGSCSP module, respectively. Compared to Faster R-CNN, YOLOv3, YOLOv5-X, YOLOv7, and YOLOv7-X, its mAP is higher by 21.3%, 4.6%, 10.8%, 5.5%, 4.8%, respectively, demonstrating the significant results and importance of optimizing deep learning models.
Integration of attention mechanism: For the small and densely distributed pests issue in pest detection tasks, the problem can be addressed by using an attention mechanism. Experimental results from [97,98,99] demonstrate that the model from [99] integrated with a channel attention mechanism performs better than the models from [98,99]. The experimental results from [103,104,105,106,107] also demonstrate the excellent performance achieved by model incorporated attention mechanisms. By employing the attention mechanism, the model can focus on key areas within the image, such as specific parts of the pests, thereby improving the detection accuracy of small targets.
Hybrid architecture network: A hybrid architecture network typically combines the strengths of various network structures and techniques to address limitations that a single network may encounter. Ref. [108] proposed a hybrid architecture combining CNN and Transformer. Compared to Faster R-CNN, SSD, YOLOv5, YOLOv7, and YOLOv8, its average accuracy is higher by 17.04%, 11.23%, 5.78%, 3.75%, and 2.71%, respectively, demonstrating that the hybrid model integrates the fast inference speed of YOLO with the high accuracy of the Transformer.

6.2.2. Directions in Pest Detection

Multi-image fusion: For the complex agricultural environments and variability in pest appearance issues in pest recognition tasks, these problems can be solved by adopting a multi-image fusion approach. Ref. [115] proposed a multi-image fusion recognition approach based on ResNet-50. Experimental results demonstrated that this model achieved an accuracy of 96.1% and 100%, with the fusion of five and two images on the IP102 dataset and D0 dataset, respectively, illustrating the great potential of multi-image fusion methods.
Multimodal feature fusion: Experimental results from [121,122] demonstrate that the Transformer model, utilizing multimodal feature fusion techniques, achieves outstanding recognition accuracy on the self-collected image-text multimodal dataset. Therefore, in addition to visual data, integrating other types of data such as text and audio data into pest detection and recognition models should also be considered. Multimodal datasets can provide comprehensive pest feature information, thereby enhancing the algorithm’s generalization ability and accuracy.
Applying transfer learning with pre-trained models: Deep learning models typically require substantial annotated data for training; transfer learning presents an effective solution. Leveraging pre-trained models on extensive datasets allows rapid adaptation to novel pest detection tasks, even under data constraints. Experimental results from [116], as shown in Table 6, demonstrate that utilizing an ensemble-trained CNN network through transfer learning achieves excellent results on the self-collected dataset.

7. Conclusions

With the rapid development of artificial intelligence technologies, deep learning-based pest detection and recognition algorithms have gradually replaced traditional manual feature extraction methods. These algorithms significantly enhance detection and recognition efficiency and accuracy by learning complex feature representations from large amounts of data. This paper compares the evaluation metrics of various pest detection and recognition algorithms proposed in the past five years across different datasets, summarizing the current state of research and the challenges faced by these algorithms, and proposing future research directions. Currently, some deep learning models, such as ResNet and the YOLO series, have demonstrated outstanding performance across multiple datasets, proving their effectiveness in feature extraction and pest detection and recognition. Additionally, the incorporation of ensemble learning methods and attention mechanisms has further improved the precision and robustness of these models. Despite significant progress, pest detection and recognition algorithms still face challenges, including background noise in complex agricultural environments, the diversity in pest appearance, and the detection of densely distributed small targets. To address these challenges, this paper suggests future research directions, including the construction of multimodal pest datasets, the use of transfer learning, and the development of hybrid architecture networks. Through a comprehensive evaluation of pest detection and recognition algorithms in recent years, this paper aims to provide a knowledge framework for researchers and practitioners in this field, stimulating further innovative thinking and exploration of applications, thereby contributing to global food security and the sustainable development of agriculture.

Author Contributions

Investigation, B.G. and M.G.; writing—original draft preparation, B.G.; writing—review and editing, M.C., Y.C., M.G. and B.G.; supervision, J.W. and Y.M.; project administration, J.W. and Y.M.; funding acquisition, J.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received funding from the Innovation 2030 Major S&T Projects of China (2021ZD0113601), Key R&D Project 475 in Shaanxi Province (2023-ZDLNY-65), and Central Guidance on Local Science and Technology Development Fund (2022ZY1-CGZY-01HZ01).

Data Availability Statement

IP102 Dataset is available at https://github.com/xpwu95/IP102, accessed on 27 October 2023; Pest24 Dataset is available at http://aisys.iim.ac.cn/zhibao.html, accessed on 27 October 2023.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Sajitha, P.; Andrushia, A.D.; Anand, N.; Naser, M. A Review on Machine Learning and Deep Learning Image-based Plant Disease Classification for Industrial Farming Systems. J. Ind. Inf. Integr. 2024, 38, 100572. [Google Scholar] [CrossRef]
Ebrahimi, M.; Khoshtaghaza, M.H.; Minaei, S.; Jamshidi, B. Vision-based pest detection based on SVM classification method. Comput. Electron. Agric. 2017, 137, 52–58. [Google Scholar] [CrossRef]
Rajan, P.; Radhakrishnan, B.; Suresh, L.P. Detection and classification of pests from crop images using support vector machine. In Proceedings of the 2016 International Conference on Emerging Technological Trends (ICETT), Kollam, India, 21–22 October 2016; pp. 1–6. [Google Scholar]
Sethy, P.K.; Bhoi, C.; Barpanda, N.K.; Panda, S.; Behera, S.K.; Rath, A.K. Pest Detection and Recognition in Rice Crop Using SVM in Approach of Bag-Of-Words. In Proceedings of the International Conference on Software and System Processes, Paris, France, 5–7 July 2017. [Google Scholar]
Ashok, P.; Jayachandran, J.; Gomathi, S.S.; Jayaprakasan, M. Pest detection and identification by applying color histogram and contour detectionby Svm model. Int. J. Eng. Adv. Technol. 2019, 8, 463–467. [Google Scholar]
Kasinathan, T.; Uyyala, S.R. Machine learning ensemble with image processing for pest identification and classification in field crops. Neural Comput. Appl. 2021, 33, 7491–7504. [Google Scholar] [CrossRef]
Kasinathan, T.; Singaraju, D.; Uyyala, S.R. Insect classification and detection in field crops using modern machine learning techniques. Inf. Process. Agric. 2021, 8, 446–457. [Google Scholar] [CrossRef]
Pattnaik, G.; Parvathy, K. Machine learning-based approaches for tomato pest classification. TELKOMNIKA Telecommun. Comput. Electron. Control 2022, 20, 321–328. [Google Scholar] [CrossRef]
Kakulapati, V.; Saiteja, S.; Raviteja, S.; Reddy, K.R. A Novel Approach Of Pest Recognition By Analyzing Ensemble Modeling. Solid State Technol. 2020, 63, 1696–1704. [Google Scholar]
Yang, Z.; Li, W.; Li, M.; Yang, X. Automatic greenhouse pest recognition based on multiple color space features. Int. J. Agric. Biol. Eng. 2021, 14, 188–195. [Google Scholar] [CrossRef]
Luo, Q.; Xin, W.; Qiming, X. Identification of pests and diseases of Dalbergia hainanensis based on EVI time series and classification of decision tree. In IOP Conference Series: Earth and Environmental Science; IOP Publishing: Bristol, UK, 2017; Volume 69, p. 012162. [Google Scholar]
Banlawe, I.A.P.; Cruz, J.C.D.; Gaspar, J.C.P.; Gutierrez, E.J.I. Decision tree learning algorithm and naïve Bayes classifier algorithm comparative classification for mango pulp weevil mating activity. In Proceedings of the 2021 IEEE International Conference on Automatic Control & Intelligent Systems (I2CACIS), Online, 26 June 2021; pp. 317–322. [Google Scholar]
Sangeetha, T.; Lavanya, G.; Jeyabharathi, D.; Kumar, T.R.; Mythili, K. Detection of pest and disease in banana leaf using convolution Random Forest. Test Eng. Manag. 2020, 83, 3727–3735. [Google Scholar]
Sharma, S.; Kumar, V.; Sood, S. Pest Detection Using Random Forest. In Proceedings of the 2023 International Conference on IoT, Communication and Automation Technology (ICICAT), Gorakhpur, India, 23–24 June 2023; pp. 1–8. [Google Scholar]
Pusadan, M.Y.; Abdullah, A.I. k-Nearest Neighbor and Feature Extraction on Detection of Pest and Diseases of Cocoa. J. RESTI Rekayasa Sist. Dan Teknol. Inf. 2022, 6, 471–480. [Google Scholar]
Li, Y.; Ercisli, S. Data-efficient crop pest recognition based on KNN distance entropy. Sustain. Comput. Inform. Syst. 2023, 38, 100860. [Google Scholar] [CrossRef]
Resti, Y.; Irsan, C.; Putri, M.T.; Yani, I.; Ansyori, A.; Suprihatin, B. Identification of corn plant diseases and pests based on digital images using multinomial naïve bayes and k-nearest neighbor. Sci. Technol. Indones. 2022, 7, 29–35. [Google Scholar] [CrossRef]
Chen, J.W.; Lin, W.J.; Cheng, H.J.; Hung, C.L.; Lin, C.Y.; Chen, S.P. A smartphone-based application for scale pest detection using multiple-object detection methods. Electronics 2021, 10, 372. [Google Scholar] [CrossRef]
Süto, J. Embedded system-based sticky paper trap with deep learning-based insect-counting algorithm. Electronics 2021, 10, 1754. [Google Scholar] [CrossRef]
Góral, P.; Pawłowski, P.; Piniarski, K.; Dąbrowski, A. Multi-Agent Vision System for Supporting Autonomous Orchard Spraying. Electronics 2024, 13, 494. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
Iandola, F.N.; Han, S.; Moskewicz, M.W.; Ashraf, K.; Dally, W.J.; Keutzer, K. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5 MB model size. arXiv 2016, arXiv:1602.07360. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Wagle, S.A.; Varadarajan, V.; Kotecha, K. A new compact method based on a convolutional neural network for classification and validation of tomato plant disease. Electronics 2022, 11, 2994. [Google Scholar] [CrossRef]
Yi, S.L.; Qin, S.L.; She, F.R.; Wang, T.W. RED-CNN: The multi-classification network for pulmonary diseases. Electronics 2022, 11, 2896. [Google Scholar] [CrossRef]
Zhu, Z.; Wang, S.; Zhang, Y. ROENet: A ResNet-based output ensemble for malaria parasite classification. Electronics 2022, 11, 2040. [Google Scholar] [CrossRef] [PubMed]
Fu’adah, Y.N.; Lim, K.M. Classification of Atrial Fibrillation and Congestive Heart Failure Using Convolutional Neural Network with Electrocardiogram. Electronics 2022, 11, 2456. [Google Scholar] [CrossRef]
Rajeena P.P., F.; Orban, R.; Vadivel, K.S.; Subramanian, M.; Muthusamy, S.; Elminaam, D.S.A.; Nabil, A.; Abulaigh, L.; Ahmadi, M.; Ali, M.A. A novel method for the classification of butterfly species using pre-trained CNN models. Electronics 2022, 11, 2016. [Google Scholar] [CrossRef]
Amin, R.; Reza, M.S.; Okuyama, Y.; Tomioka, Y.; Shin, J. A Fine-Tuned Hybrid Stacked CNN to Improve Bengali Handwritten Digit Recognition. Electronics 2023, 12, 3337. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Proceedings of the Advances in Neural Information Processing Systems 28 (NIPS 2015), Montreal, QC, Canada, 7–12 December 2015. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14; Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
Akhtar, M.J.; Mahum, R.; Butt, F.S.; Amin, R.; El-Sherbeeny, A.M.; Lee, S.M.; Shaikh, S. A robust framework for object detection in a traffic surveillance system. Electronics 2022, 11, 3425. [Google Scholar] [CrossRef]
Cong, P.; Lv, K.; Feng, H.; Zhou, J. Improved yolov3 model for workpiece stud leakage detection. Electronics 2022, 11, 3430. [Google Scholar] [CrossRef]
Amran, G.A.; Alsharam, M.S.; Blajam, A.O.A.; Hasan, A.A.; Alfaifi, M.Y.; Amran, M.H.; Gumaei, A.; Eldin, S.M. Brain tumor classification and detection using hybrid deep tumor network. Electronics 2022, 11, 3457. [Google Scholar] [CrossRef]
Dai, J.; Li, T.; Xuan, Z.; Feng, Z. Automated defect analysis system for industrial computerized tomography images of solid rocket motor grains based on yolo-v4 model. Electronics 2022, 11, 3215. [Google Scholar] [CrossRef]
Gu, Z.; Zhu, K.; You, S. YOLO-SSFS: A Method Combining SPD-Conv/STDL/IM-FPN/SIoU for Outdoor Small Target Vehicle Detection. Electronics 2023, 12, 3744. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18; Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Bhan, A.; Mangipudi, P.; Goyal, A. Deep Learning Approach for Automatic Segmentation and Functional Assessment of LV in Cardiac MRI. Electronics 2022, 11, 3594. [Google Scholar] [CrossRef]
Gargari, M.S.; Seyedi, M.H.; Alilou, M. Segmentation of Retinal Blood Vessels Using U-Net++ Architecture and Disease Prediction. Electronics 2022, 11, 3516. [Google Scholar] [CrossRef]
Yang, D.; Wang, C.; Cheng, C.; Pan, G.; Zhang, F. Semantic segmentation of side-scan sonar images with few samples. Electronics 2022, 11, 3002. [Google Scholar] [CrossRef]
Xu, F.; Huang, J.; Wu, J.; Jiang, L. Active mask-box scoring r-cnn for sonar image instance segmentation. Electronics 2022, 11, 2048. [Google Scholar] [CrossRef]
Xie, X.; Bai, L.; Huang, X. Real-time LiDAR point cloud semantic segmentation for autonomous driving. Electronics 2021, 11, 11. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Bengio, Y.; Simard, P.; Frasconi, P. Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 1994, 5, 157–166. [Google Scholar] [CrossRef] [PubMed]
Pascanu, R.; Mikolov, T.; Bengio, Y. On the difficulty of training recurrent neural networks. In Proceedings of the International Conference on Machine Learning, Atlanta, GA, USA, 16–21 June 2013; pp. 1310–1318. [Google Scholar]
Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 448–456. [Google Scholar]
Nair, V.; Hinton, G.E. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), Haifa, Israel, 21–24 June 2010; pp. 807–814. [Google Scholar]
Glorot, X.; Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, JMLR Workshop and Conference Proceedings, Sardinia, Italy, 13–15 May 2010; pp. 249–256. [Google Scholar]
Glorot, X.; Bordes, A.; Bengio, Y. Deep sparse rectifier neural networks. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics. JMLR Workshop and Conference Proceedings, Fort Lauderdale, FL, USA, 11–13 April 2011; pp. 315–323. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv 2014, arXiv:1406.1078. [Google Scholar]
Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to sequence learning with neural networks. In Proceedings of the Advances in Neural Information Processing Systems 27 (NIPS 2014), Montreal, QC, Canada, 8–13 December 2014. [Google Scholar]
Cho, K.; Van Merriënboer, B.; Bahdanau, D.; Bengio, Y. On the properties of neural machine translation: Encoder-decoder approaches. arXiv 2014, arXiv:1409.1259. [Google Scholar]
Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
Luong, M.T.; Pham, H.; Manning, C.D. Effective approaches to attention-based neural machine translation. arXiv 2015, arXiv:1508.04025. [Google Scholar]
Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning representations by back-propagating errors. Nature 1986, 323, 533–536. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Lample, G.; Conneau, A.; Denoyer, L.; Ranzato, M. Unsupervised machine translation using monolingual corpora only. arXiv 2017, arXiv:1711.00043. [Google Scholar]
See, A.; Liu, P.J.; Manning, C.D. Get to the point: Summarization with pointer-generator networks. arXiv 2017, arXiv:1704.04368. [Google Scholar]
Liu, Y.; Lapata, M. Text summarization with pretrained encoders. arXiv 2019, arXiv:1908.08345. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Henaff, O. Data-efficient image recognition with contrastive predictive coding. In Proceedings of the International Conference on Machine Learning, Vienna, Austria, 12–18 July 2020; pp. 4182–4192. [Google Scholar]
Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the International Conference on Machine Learning, Online, 18–24 July 2021; pp. 10347–10357. [Google Scholar]
Zhang, Q.; Yang, Y.B. Rest: An efficient transformer for visual recognition. Adv. Neural Inf. Process. Syst. 2021, 34, 15475–15485. [Google Scholar]
Li, Y.; Yao, T.; Pan, Y.; Mei, T. Contextual transformer networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 1489–1500. [Google Scholar] [CrossRef] [PubMed]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
Sun, Z.; Cao, S.; Yang, Y.; Kitani, K.M. Rethinking transformer-based set prediction for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Online, 11–17 October 2021; pp. 3611–3620. [Google Scholar]
Li, Y.; Mao, H.; Girshick, R.; He, K. Exploring plain vision transformer backbones for object detection. In Proceedings of the European Conference on Computer Vision 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 280–296. [Google Scholar]
Xia, L.; Cao, S.; Cheng, Y.; Niu, L.; Zhang, J.; Bao, H. Rotating Object Detection for Cranes in Transmission Line Scenarios. Electronics 2023, 12, 5046. [Google Scholar] [CrossRef]
Huo, L.; Guo, K.; Wang, W. An Adaptive Multi-Content Complementary Network for Salient Object Detection. Electronics 2023, 12, 4600. [Google Scholar] [CrossRef]
Wang, Y.; Xu, Z.; Wang, X.; Shen, C.; Cheng, B.; Shen, H.; Xia, H. End-to-end video instance segmentation with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Online, 19–25 June 2021; pp. 8741–8750. [Google Scholar]
Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P.H.; et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Online, 19–25 June 2021; pp. 6881–6890. [Google Scholar]
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, L.; Yuille, A.L.; Zhou, Y. Transunet: Transformers make strong encoders for medical image segmentation. arXiv 2021, arXiv:2102.04306. [Google Scholar]
Jiao, C.; Yang, T.; Yan, Y.; Yang, A. RFTNet: Region–Attention Fusion Network Combined with Dual-Branch Vision Transformer for Multimodal Brain Tumor Image Segmentation. Electronics 2023, 13, 77. [Google Scholar] [CrossRef]
Baek, J.H.; Lee, H.K.; Choo, H.G.; Jung, S.h.; Koh, Y.J. Center-Guided Transformer for Panoptic Segmentation. Electronics 2023, 12, 4801. [Google Scholar] [CrossRef]
Arnab, A.; Dehghani, M.; Heigold, G.; Sun, C.; Lučić, M.; Schmid, C. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Online, 11–17 October 2021; pp. 6836–6846. [Google Scholar]
Neimark, D.; Bar, O.; Zohar, M.; Asselmann, D. Video transformer network. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Online, 11–17 October 2021; pp. 3163–3172. [Google Scholar]
Yang, J.; Dong, X.; Liu, L.; Zhang, C.; Shen, J.; Yu, D. Recurring the transformer for video action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 14063–14073. [Google Scholar]
Ranasinghe, K.; Naseer, M.; Khan, S.; Khan, F.S.; Ryoo, M.S. Self-supervised video transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 2874–2884. [Google Scholar]
Liang, J.; Cao, J.; Fan, Y.; Zhang, K.; Ranjan, R.; Li, Y.; Timofte, R.; Van Gool, L. Vrt: A video restoration transformer. IEEE Trans. Image Process. 2024, 33, 2171–2182. [Google Scholar] [CrossRef]
Wu, X.; Zhan, C.; Lai, Y.K.; Cheng, M.M.; Yang, J. Ip102: A large-scale benchmark dataset for insect pest recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 8787–8796. [Google Scholar]
Xie, C.; Wang, R.; Zhang, J.; Chen, P.; Dong, W.; Li, R.; Chen, T.; Chen, H. Multi-level learning features for automatic classification of field crop pests. Comput. Electron. Agric. 2018, 152, 233–241. [Google Scholar] [CrossRef]
Wang, Q.J.; Zhang, S.Y.; Dong, S.F.; Zhang, G.C.; Yang, J.; Li, R.; Wang, H.Q. Pest24: A large-scale very small object data set of agricultural pests for multi-target detection. Comput. Electron. Agric. 2020, 175, 105585. [Google Scholar] [CrossRef]
Nanni, L.; Manfè, A.; Maguolo, G.; Lumini, A.; Brahnam, S. High performing ensemble of convolutional neural networks for insect pest image detection. Ecol. Inform. 2022, 67, 101515. [Google Scholar] [CrossRef]
Chen, M.; Chen, Y.; Guo, M.; Wang, J. Pest Detection and Identification Guided by Feature Maps. In Proceedings of the 2023 Twelfth International Conference on Image Processing Theory, Tools and Applications (IPTA), Paris, France, 16–19 October 2023; pp. 1–6. [Google Scholar]
Yang, S.; Xing, Z.; Wang, H.; Dong, X.; Gao, X.; Liu, Z.; Zhang, X.; Li, S.; Zhao, Y. Maize-YOLO: A new high-precision and real-time method for maize pest detection. Insects 2023, 14, 278. [Google Scholar] [CrossRef] [PubMed]
Tang, Z.; Chen, Z.; Qi, F.; Zhang, L.; Chen, S. Pest-YOLO: Deep image mining and multi-feature fusion for real-time agriculture pest detection. In Proceedings of the 2021 IEEE International Conference on Data Mining (ICDM), Auckland, New Zealand, 7–10 December 2021; pp. 1348–1353. [Google Scholar]
Tang, Z.; Lu, J.; Chen, Z.; Qi, F.; Zhang, L. Improved Pest-YOLO: Real-time pest detection based on efficient channel attention mechanism and transformer encoder. Ecol. Inform. 2023, 78, 102340. [Google Scholar] [CrossRef]
Qi, F.; Wang, Y.; Tang, Z.; Chen, S. Real-time and effective detection of agricultural pest using an improved YOLOv5 network. J. Real-Time Image Process. 2023, 20, 33. [Google Scholar] [CrossRef]
Wang, F.; Wang, R.; Xie, C.; Yang, P.; Liu, L. Fusing multi-scale context-aware information representation for automatic in-field pest detection and recognition. Comput. Electron. Agric. 2020, 169, 105222. [Google Scholar] [CrossRef]
Jiao, L.; Li, G.; Chen, P.; Wang, R.; Du, J.; Liu, H.; Dong, S. Global context-aware-based deformable residual network module for precise pest recognition and detection. Front. Plant Sci. 2022, 13, 895944. [Google Scholar] [CrossRef] [PubMed]
Dai, M.; Dorjoy, M.M.H.; Miao, H.; Zhang, S. A new pest detection method based on improved YOLOv5m. Insects 2023, 14, 54. [Google Scholar] [CrossRef]
Yang, Z.; Feng, H.; Ruan, Y.; Weng, X. Tea tree pest detection algorithm based on improved Yolov7-Tiny. Agriculture 2023, 13, 1031. [Google Scholar] [CrossRef]
Tian, Y.; Wang, S.; Li, E.; Yang, G.; Liang, Z.; Tan, M. MD-YOLO: Multi-scale Dense YOLO for small target pest detection. Comput. Electron. Agric. 2023, 213, 108233. [Google Scholar] [CrossRef]
Chu, J.; Li, Y.; Feng, H.; Weng, X.; Ruan, Y. Research on multi-scale pest detection and identification method in granary based on improved YOLOv5. Agriculture 2023, 13, 364. [Google Scholar] [CrossRef]
Li, K.; Wang, J.; Jalil, H.; Wang, H. A fast and lightweight detection algorithm for passion fruit pests based on improved YOLOv5. Comput. Electron. Agric. 2023, 204, 107534. [Google Scholar] [CrossRef]
Chen, J.; Chen, W.; Nanehkaran, Y.; Suzauddola, M. MAM-IncNet: An end-to-end deep learning detector for Camellia pest recognition. Multimed. Tools Appl. 2024, 83, 31379–31394. [Google Scholar] [CrossRef]
Ye, R.; Gao, Q.; Qian, Y.; Sun, J.; Li, T. Improved Yolov8 and Sahi Model for the Collaborative Detection of Small Targets at the Micro Scale: A Case Study of Pest Detection in Tea. Agronomy 2024, 14, 1034. [Google Scholar] [CrossRef]
Liu, W.; Wu, G.; Ren, F.; Kang, X. DFF-ResNet: An insect pest recognition model based on residual networks. Big Data Min. Anal. 2020, 3, 300–310. [Google Scholar] [CrossRef]
Ayan, E.; Erbay, H.; Varçın, F. Crop pest classification with a genetic algorithm-based weighted ensemble of deep convolutional neural networks. Comput. Electron. Agric. 2020, 179, 105809. [Google Scholar] [CrossRef]
Feng, F.; Dong, H.; Zhang, Y.; Zhang, Y.; Li, B. Ms-aln: Multiscale attention learning network for pest recognition. IEEE Access 2022, 10, 40888–40898. [Google Scholar] [CrossRef]
Zheng, T.; Yang, X.; Lv, J.; Li, M.; Wang, S.; Li, W. An efficient mobile model for insect image classification in the field pest management. Eng. Sci. Technol. Int. J. 2023, 39, 101335. [Google Scholar] [CrossRef]
Devi, R.; Kumar, V.; Sivakumar, P. EfficientNetV2 Model for Plant Disease Classification and Pest Recognition. Comput. Syst. Sci. Eng. 2023, 45, 2249–2263. [Google Scholar] [CrossRef]
Anwar, Z.; Masood, S. Exploring Deep Ensemble Model for Insect and Pest Detection from Images. Procedia Comput. Sci. 2023, 218, 2328–2337. [Google Scholar] [CrossRef]
Chen, Y.; Chen, M.; Guo, M.; Wang, J.; Zheng, N. Pest recognition based on multi-image feature localization and adaptive filtering fusion. Front. Plant Sci. 2023, 14, 1282212. [Google Scholar] [CrossRef]
Nandhini, C.; Brindha, M. Visual regenerative fusion network for pest recognition. Neural Comput. Appl. 2024, 36, 2867–2882. [Google Scholar] [CrossRef]
Li, Y.; Wang, H.; Dang, L.M.; Sadeghi-Niaraki, A.; Moon, H. Crop pest recognition in natural scenes using convolutional neural networks. Comput. Electron. Agric. 2020, 169, 105174. [Google Scholar] [CrossRef]
Chen, J.; Chen, W.; Zeb, A.; Zhang, D.; Nanehkaran, Y.A. Crop pest recognition using attention-embedded lightweight network under field conditions. Appl. Entomol. Zool. 2021, 56, 427–442. [Google Scholar] [CrossRef]
Xu, C.; Yu, C.; Zhang, S.; Wang, X. Multi-scale convolution-capsule network for crop insect pest recognition. Electronics 2022, 11, 1630. [Google Scholar] [CrossRef]
Zhao, S.; Liu, J.; Bai, Z.; Hu, C.; Jin, Y. Crop pest recognition in real agricultural environment using convolutional neural networks by a parallel attention mechanism. Front. Plant Sci. 2022, 13, 839572. [Google Scholar] [CrossRef] [PubMed]
Dai, G.; Fan, J.; Dewi, C. ITF-WPI: Image and text based cross-modal feature fusion model for wolfberry pest recognition. Comput. Electron. Agric. 2023, 212, 108129. [Google Scholar] [CrossRef]
Zhang, Y.; Chen, L.; Yuan, Y. Multimodal fine-grained transformer model for pest recognition. Electronics 2023, 12, 2620. [Google Scholar] [CrossRef]
Hassan, S.M.; Maji, A.K. Pest Identification based on fusion of Self-Attention with ResNet. IEEE Access 2024, 12, 6036–6050. [Google Scholar] [CrossRef]

Figure 2. The architecture of a CNN.

Figure 3. The architecture of a residual block in ResNet.

Figure 4. The forward propagation architecture of YOLO.

Figure 5. The forward propagation architecture of Transformer.

Figure 6. Overview of Vision Transformer model.

Figure 7. Overview of Detection Transformer model.

Figure 8. This figure shows the different appearances of Chlumetia transversa in different life stages. (a) Appearance of Chlumetia transversa in larval stages. (b) Appearance of Chlumetia transversa in pupal stages. (c) Appearance of Chlumetia transversa in adult stages.

Figure 9. This figure shows the indistinguishable similar appearances and behaviors of four different categories pests. (a) Appearance of Black cutworm. (b) Appearance of Flax budworm. (c) Appearance of Prodenia litura. (d) Appearance of Chlumetia transversa.

Figure 10. This figure shows the traces different traces of four categories pests without showing the pests themselves. (a) Traces of Wheat sawflies. (b) Traces of Beet army worms. (c) Traces of Pseudococcus comstocki kuwanas. (d) Traces of Chrysomphalus aonidums.

Figure 13. This figure shows the comparisons of mAP metrics for deep learning pest detection algorithms on different datasets. (a) Comparison of mAP metrics for algorithms tested on the IP102 dataset. (b) Comparison of mAP metrics for algorithms tested on the Pest24 dataset.

Figure 14. The comparison of mAP metrics for deep learning pest detection algorithms tested on different self-collected datasets.

Figure 15. The comparison of Accuracy metrics for deep learning pest recognition algorithms tested on the IP102 dataset.

Figure 16. The comparison of Accuracy metrics for deep learning pest recognition algorithms tested on different self-collected datasets.

Table 1. Strengths and weaknesses of CNN and Transformer architectures.

	CNN	Transformer
Strengths	Local connectivity and weight sharing	Global feature extraction
	Translation invariance	Parallel computing
	Mature models and techniques	Unified architecture
Weaknesses	Limited receptive field	High computational and memory costs
	More complexity for more layers	Requires large-scale data for training
	Limited multi-scale feature handling	Lack of intrinsic spatial encoding

Table 2. The performance comparison of 15 pest detection algorithms in recent years.

Paper	Year	Methods	Dataset	Metrics
[94]	2022	Ensemble Learning	IP102 Dataset	mAP: 74.1% F1-Score: 73.0%
[95]	2023	ResNet-50 CAM		mAP: 74.27%
[96]	2023	Improved YOLOv7 CSPResNeXt-50 VoVGSCSP		mAP: 76.3% Precision: 73.1% Recall: 77.3% F1-Score: 75.2%
[97]	2021	ResNet-50 Feature Fusion YOLOv3	Pest24 Dataset	mAP: 71.6% Recall: 83.5%
[98]	2022	ResNet-50 Transformer Feature Fusion YOLOv3		mAP: 73.4% Recall: 83.9%
[99]	2023	GhostNet Improved YOLOv5 Channel Attention		mAP: 74.1%
[94]	2022	Ensemble Learning	D0 Dataset	mAP: 99.8% F1-Score: 99.7%
[100]	2020	ResNet-50 Context-Aware Attention	17,192 collected images	mAP: 74.3%
[101]	2022	DBR-Net FPN Residual Learning	24,412 collected images	mAP: 77.8%
[102]	2023	Improved YOLOv5m Transformer	1309 collected images	mAP: 96.4% Precision: 95.7% Recall: 93.1% F1-Score: 94.38%
[103]	2023	Improved YOLOv7-Tiny Biformer Dynamic Attention	782 collected images	mAP: 93.23% Recall: 90.81%
[104]	2023	DenseNet Adaptive Attention YOLOv3	289 collected images	mAP: 86.2% F1-Score: 79.1%
[105]	2023	Improved YOLOv5 Channel Attention FPN	5231 collected images	mAP: 98.2% Accuracy: 97.20% Recall: 96.85%
[106]	2023	Improved YOLOv5 Adaptive Attention	6000 collected images	mAP: 96.51% F1-Score: 96.54%
[107]	2024	VGG-16 Channel and Spatial Attention	1035 collected images	mAP: 95.87% Recall: 81.44% Precision: 97.53% F1-Score: 88.76%
[108]	2024	Improved YOLOv8 Transformer	2864 collected images	mAP: 98.17% Precision: 96.32% Recall: 97.95% F1-Score: 97.13%

Table 3. Pros and cons of pest detection algorithms tested on the IP102 dataset.

Paper	Pros	Cons
[94]	High training efficiency	High computational complexity
	High robustness	Long training time
		Complex management of models
[95]	Strong algorithm generality	High demand for computational resources
	Low dependency on annotated data
[96]	High detection accuracy	Poor generalization ability
	High computational efficiency	Complex model

Table 4. Pros and cons of pest detection algorithms tested on the Pest24 dataset.

Paper	Pros	Cons
[97]	Excellent performance on small objects	Complex model
		High data dependency
[98]	Excellent performance on small objects	High computational complexity
	Strong global feature capture ability	Large model size
[99]	Good real-time performance	Poor generalization ability
	High accuracy	High demand for computational
	Small model footprint	resources

Table 5. Pros and cons of pest detection algorithms tested on different self-collected datasets.

Paper	Pros	Cons
[100]	Multi-scale information fusion	Complex practical application
	Suitable for large-scale datasets	Long training time
[101]	Good real-time performance	High data dependency
	Suitable for large-scale datasets	High demand for computational
		resources
[102]	High detection accuracy	Complex model
	Good robustness
	High computational efficiency
[103]	Strong feature extraction capability	Poor generalization ability
	High detection accuracy	Complex model
[104]	Multi-scale detection capability	Long training time
	Good real-time performance	Poor generalization ability
		Complex model
[105]	High detection accuracy	Complex model
	Excellent performance on small objects
	Good robustness
[106]	High detection accuracy	Complex model
	Good real-time performance
	Small model footprint
[107]	High detection accuracy	Poor generalization ability
		Complex model
[108]	Strong feature extraction capability	Long training time
	High detection accuracy	Complex model
	Good real-time performance	High demand for computational
		resources

Table 6. The performance comparison of 15 pest recognition algorithms in recent years.

Paper	Year	Method	Dataset	Metrics
[109]	2020	ResNet Feature Fusion	IP102 Dataset	Accuracy: 55.43% F1-Score: 54.18%
[110]	2020	Ensemble Learning		Accuracy: 67.13% Precision: 67.17% Recall: 67.13% F1-Score: 65.76%
[111]	2022	ResNet-50 Attention		Accuracy: 74.61% F1-Score: 67.83%
[112]	2023	EfficientNetV2 Coordinate Attention Feature Fusion		Accuracy: 73.7%
[113]	2023	EfficientNetV2 Transfer Learning		Accuracy: 80.1%
[114]	2023	Ensemble Learning Transfer Learning		Accuracy: 82.5%
[115]	2023	ResNet-50 Multi-Image Fusion		Accuracy: 96.1% F1-Score: 95.9%
[116]	2024	ResNet Feature Fusion		Accuracy: 68.34% Precision: 68.37% Recall: 68.33% F1-Score: 68.34%
[110]	2020	Ensemble Learning	D0 Dataset	Accuracy: 98.81% Precision: 98.88% Recall: 98.81% F1-Score: 98.81%
[115]	2023	ResNet-50 Multi-Image Fusion		Accuracy: 100% F1-Score: 100%
[116]	2024	ResNet Feature Fusion		Accuracy: 99.12% Precision: 99.84% Recall: 99.12% F1-Score: 99.13%
[117]	2020	GoogleNet	5629 collected images	Accuracy: 96.67%
[118]	2021	MobileNetv2 CAM		Accuracy: 99.14%
[119]	2022	CNN CapsNet	Subset of IP102 Dataset	Accuracy: 91.4%
[120]	2022	ResNet-50 Parallel Attention	5245 collected images	Accuracy: 98.17%
[121]	2022	Transformer Cross-modal Feature Fusion	10,598 collected images	Accuracy: 97.98%
[122]	2023	Transformer Cross-modal Feature Fusion	1902 collected images	Accuracy: 98.12% Precision: 99.07% Recall: 98.56% F1-Score: 98.50%
[123]	2024	ResNet-50	3150 collected images	Accuracy: 99.80%

Table 7. Pros and cons of pest recognition algorithms tested on the IP102 dataset.

Paper	Pros	Cons
[109]	Strong generalization ability	Low recognition accuracy
		High demand for computational
		resources
[110]	High recognition accuracy	High computational complexity
		Long training time
		Complex management of models
[111]	Strong feature extraction capability	Large model footprint
	High recognition accuracy	Long training time
		Complex model
[112]	Small model footprint	Long training time
	Strong generalization ability	Complex model
	High recognition accuracy
[113]	Strong model learning capability	Static network topology
	High recognition accuracy	High demand for computational
		resources
[114]	High recognition accuracy	Static network topology
	Strong generalization ability	High computational complexity
	Good robustness	Complex management of models
[115]	High recognition accuracy	Long training time
		Complex model
[116]	High recognition accuracy	High computational complexity
	Good robustness	Poor generalization ability

Table 8. Pros and cons of pest recognition algorithms tested on different self-collected datasets.

Paper	Pros	Cons
[117]	High recognition accuracy	Complex model
	Good robustness	High demand for computational
		resources
[118]	High recognition accuracy	Poor generalization ability
	Small model footprint
[119]	Multi-scale feature extraction	High training difficulty
	High recognition accuracy	Complex model
[120]	High recognition accuracy	Poor performance on small objects
	Good real-time performance	Complex model
[121]	Cross-modal feature fusion	Complex practical application
	High recognition accuracy	Poor generalization ability
	Suitable for large-scale datasets	Complex model
[122]	Cross-modal feature fusion	Complex practical application
	High recognition accuracy	Poor generalization ability
		Complex model
[123]	Strong feature extraction capability	High computational complexity
	High recognition accuracy	Complex model

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Guo, B.; Wang, J.; Guo, M.; Chen, M.; Chen, Y.; Miao, Y. Overview of Pest Detection and Recognition Algorithms. Electronics 2024, 13, 3008. https://doi.org/10.3390/electronics13153008

AMA Style

Guo B, Wang J, Guo M, Chen M, Chen Y, Miao Y. Overview of Pest Detection and Recognition Algorithms. Electronics. 2024; 13(15):3008. https://doi.org/10.3390/electronics13153008

Chicago/Turabian Style

Guo, Boyu, Jianji Wang, Minghui Guo, Miao Chen, Yanan Chen, and Yisheng Miao. 2024. "Overview of Pest Detection and Recognition Algorithms" Electronics 13, no. 15: 3008. https://doi.org/10.3390/electronics13153008

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Overview of Pest Detection and Recognition Algorithms

Abstract

1. Introduction

2. Literature Collection Methods

3. Background

3.1. CNN Architecture

3.1.1. ResNet

3.1.2. YOLO

3.2. Transformer Architecture

3.2.1. Vision Transformer

3.2.2. Detection Transformer

3.3. Comparison of CNN and Transformer

3.4. Evaluation Metrics

3.4.1. Evaluation Metrics for Recognition

3.4.2. Evaluation Metrics for Detection

4. Datasets

4.1. IP102 Dataset

4.2. Pest24 Dataset

5. Review of Algorithms

5.1. Pest Detection Algorithms

5.1.1. Algorithms on the IP102 Dataset

5.1.2. Algorithms on the Pest24 Dataset

5.1.3. Algorithms on Self-Collected Datasets

5.2. Pest Recognition Algorithms

5.2.1. Algorithms on the IP102 Dataset

5.2.2. Algorithms on Self-Collected Datasets

6. Challenges and Future Research Directions

6.1. Challenges

6.1.1. Complex Agricultural Environments

6.1.2. Variability in Pest Appearance

6.1.3. Small and Densely Distributed Pests

6.2. Future Research Directions

6.2.1. Directions in Pest Detection

6.2.2. Directions in Pest Detection

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI