1. Introduction
With the continuous growth of the world population, food security is a prerequisite for ensuring the normal operation of society and production. As is well known, the harm of pests accompanies the entire growth cycle of crops and is one of the main causes of global crop losses [
1]. However, there are many types of crop pests. These pests typically manifest in different forms, including damage to leaves, stems, flowers, and fruits, and may even lead to virus transmission. According to data from the Food and Agriculture Organization of the United Nations, plant diseases cause over USD 220 billion in losses to the global economy every year. In addition, up to 40% of global crop production is lost annually due to pests, resulting in losses of at least USD 70 billion [
2]. Therefore, preventing and controlling pests can not only improve food quality but also reduce agricultural losses, which is crucial for food security and stabilizing the agricultural economy [
3]. The prerequisite for implementing this task is to accurately identify pest populations for precise pest control.
Currently, most methods for identifying crop pests and diseases mainly rely on manual observation or individual sample detection, which is a time-consuming and inefficient task. As shown in
Figure 1, there are many types of pests, and there are subtle differences in the field environment. The accuracy and efficiency of pest identification mainly depend on the professional knowledge of agricultural experts. However, high-precision detection methods based on molecular detection are limited by centralized laboratories and expensive devices [
4], and also rely on professional knowledge. In addition, the similarity and diversity of pests make accurate visual classification very difficult. The appearance and characteristics of pests may change with different stages of their life cycle, further increasing the difficulty of pest identification. Complex environmental factors also increase the difficulty of accurate identification.
In recent years, with the improvement of computer hardware performance, artificial intelligence (AI) technologies have achieved super performance in computer vision studies [
5], such as object detection [
6], medical imaging classification [
7], lesion segmentation [
8], cell image analysis [
9], etc. This type of method not only achieves high accuracy but also decreases time and money consumption and provides a new way to analyze and identify crop pests.
As a representative of AI technology, deep convolutional neural networks (CNNs) are inspired by the working mode of animal vision systems, in which neurons perform local and in-depth processing of visual perception areas. CNN mainly includes the following important components: convolutional layer, pooling layer, activation function, and fully connected layer [
10]. The main advantage of CNN lies in its processing efficiency for gridded data. By sharing parameters and gradually extracting features through hierarchical structures, CNN can capture local and global patterns in the data, making it perform well in image recognition and object detection [
11]. CNN has made many important breakthroughs in the field of computer vision.
According to the survey [
12], a depth separable CNN was trained on a public dataset that consists of 14 different plant species and 38 different categorical diseases and healthy plant leaves. This CNN model not only reduces the number of parameters and computational costs but also achieves an accuracy of 98.42%. Naik et al. [
13] used 12 different pre-trained deep learning networks to identify 5 diseases from leaves, including leaf bending, gemini virus, cercospora leaf spot disease, yellow leaf disease, and upper leaf bending disease. Among them, the proposed SECNN achieved optimal accuracies of 99.12% and 98.63% with and without enhancement, respectively. In addition, a Faster R-CNN was proposed to detect rice leaf diseases [
14], which was effective in automatic diagnosis of three discriminative rice leaf diseases: rice blast, brown spot, and hispa, with accuracy rates of 98.09%, 98.85%, and 99.17%, respectively. In addition, the model can recognize healthy rice leaves with an accuracy of 99.25%. These methods have promoted the study of plant types or crop disease classification to some extent, but these methods only show strong performance in one dataset or small sample size; the generalization ability of the model needs to be further verified.
In recent years, to solve the problem of limited sample size, many researchers have applied transfer learning technology to the study of crop pests [
15,
16]. This type of method typically uses the ImageNet dataset as the pre-training dataset, and then transfers the weights of the pre-training model to another task. For example, Paymode et al. [
17] used the VGG-based transfer learning method to predict the types of diseases and pests in early grape and tomato leaves. The method has achieved an accuracy of 98.40% for grapes and 95.71% for tomatoes. Thangaraj et al. [
18] proposed a deep CNN model to identify tomato leaf disease. Krishnamoorthy et al. [
19] utilized the InceptionResNetV2 model to identify pests in rice leaf images, achieving an accuracy of 95.67%. Furthermore, residual and attention mechanisms are used for plant or agricultural pest disease detection and classification [
20]. Liu et al. present a DCNN for the visual localization and classification of pests from paddy field images [
21]. An unsupervised CNN method was developed for the classification of pest species in field crops, such as corn, soybeans, wheat, and rapeseed [
22]. This method achieved the classification of 40 types of crop pests by learning the features of the images from a large number of unlabeled image blocks. Too et al. used DenseNets to classify leaf disease and health images of 14 plants in the PlantVillage dataset [
23]. Although the above methods or mechanisms can improve the model’s attention to the target object, the limitations of CNN’s own feature acquisition still cannot be ignored. This is mainly caused by the locality of convolution operations.
The high efficiency of the transformer model in processing long sequences makes it a powerful model in the field of image analysis research. More and more researchers are applying this model to crop image studies. For example, Gu et al. [
24] proposed a classification CNN model that combines shift window transformer blocks with lightweight CNN models. They applied a shift window transformer to the process of feature extraction to obtain global feature information. In addition, Cheng et al. [
25] introduced a hybrid CNN transformer in the classification model. They used CNN and converters to extract spatial and channel feature information. Pereira et al. [
26] improved the classification CNN structure by using spatially adaptive recalibration blocks (SegSE blocks). The SegSE block recalibrates the feature map by considering cross-channel information and spatial correlation, helping to obtain global and local feature information as proposed. Saranya et al. [
27] proposed a modified ViT transformer model (HPMA), which emphasizes discriminative features and suppresses irrelevant info for robustness. HPMA was verified on three pest datasets and achieved good performance.
However, in pest classification tasks, most CNNs are limited to controlled laboratory environments, which cannot meet the requirements of pest identification in real outdoor environments. In addition, pest classification has its own characteristics that are different from natural image object classification work [
28]. Specifically, compared with other plant disease classification studies, our study focuses solely on the classification of agricultural pests, rather than the identification of multiple types of plant diseases. Although many researchers focus on using machine learning, deep learning, and other methods to analyze and explore pest identification, they prefer to study a single pest classification rather than mixing multiple classifications of various types of pests together, which is common in published work. In addition, many pests are very similar in morphology, and there may be multiple different pests on crops. This makes accurate classification visually very difficult. The appearance and characteristics of pests may change with different stages of their life cycle, which increases the difficulty of classifying multiple types of pests. And this study is a classification study for various crops and pests involving a variety of species and complex environments, which is helpful for the practical significance of intelligent classification of pests in agricultural scenarios.
In the present study, we propose a patch-based neural network (PMLPNet) for multi-class pest classification. The main contributions are listed as follows:
- (1)
PMLPNet is proposed for multi-class pest classification, which integrates local and global contextual semantic features by designed token- and channel-mixing MLP structures.
- (2)
The patch-based image input strategy not only improves the performance of PMLPNet, but also provides a basis for image heterogeneity analysis.
- (3)
The GELU activation function improves the ability of PMLPNet to fit complex data distributions and enhances the capabilities of PMLPNet.
4. Discussion
We propose a patch-based multi-layer perceptron neural network (PMLPNet) for multi-class pest classification. PMLPNet integrates spatial contextual semantic features and channel contextual semantic features. This results in high-quality pixel positioning features for the fully connected layers and activation function, which helps the model to accurately classify pests. Finally, we validated our proposed model on a multi-class pest dataset and compared it with other advanced models. PMLPNet achieved state-of-the-art performance. In addition, we visualized the heterogeneity of the extracted features and patches, verified the performance of the model, and analyzed the impact of image quality on model performance.
In this study, we extracted features from patch-level images and completed the prediction of pest species on image level. The background and foreground of patches from the same image are different, so the feature information obtained by the model from each patch is also different. Therefore, we believe that there are differences in the degree of closeness between these patches and pest species. Taking aphid images as an example, we selected nine patches based on this image for probability analysis of pest species prediction to verify the heterogeneity between patches in the same image.
As shown in
Figure 6, we input an aphid image into the PMLPNet model. After PMLPNet extracts the features, we display the prediction probability of each patch on the original image. The left side of
Figure 6 shows the input image, and the right side shows the details of the predicted probability for each patch. For an aphid image, the prediction probabilities of different patches vary greatly. This means that this probability difference can also lead to differences in feature extraction. Our proposed model can distinguish between the target area and background, but for patches that also contain the target, there is a difference in the prediction probability of our model (0.94 vs. 0.83 vs. 0.75), mainly due to the size of the difference between the pixels corresponding to aphids and the surrounding background pixels. When there is a significant difference between aphids and the surrounding background, the model can accurately determine that the target identified in the patch is aphids (0.94). However, when the pixels corresponding to aphids in the image overlap or are similar to the background pixels, the model’s recognition accuracy for aphids will be reduced. In addition, if the pixels corresponding to aphids have a small proportion of space and there are similar pixels, the recognition accuracy of the model will be greatly reduced (0.75). Based on the above reasons, the complexity of the image background and the size of the pests are factors that affect the performance of the model in classifying pests.
As shown in
Figure 7, we conducted statistical analysis on the loss and accuracy of nine comparison methods during the training process, and generated curves. We set the epoch of all comparison methods to 120; as the epoch increases, the loss values of all models decrease while the accuracy improves. As shown in
Figure 7a, when the epoch value is 100, most models have the lowest losses, except for VGG and ResNet. As shown in
Figure 7b, most models also achieved the highest accuracy, except for VGG, ResNet, and DenseNet. Based on
Figure 7a,b, it can be concluded that most models obtain their optimal state when epoch = 100. When epoch is greater than 100, it increases the likelihood of overfitting problems.
In addition, we conducted statistical analysis on the accuracy of the proposed PMLPNet in predicting 40 types of pests in the testing set. Specifically, we calculate the number of each type of pest in the testing set and the correct number predicted by the model, and calculate the accuracy of this type of pest prediction (as shown in
Figure 8).
Figure 8 is a combination figure, including a histogram chart and a line chart. The histogram chart shows the number of each type of pest and the number of correctly predicted pests in the testing set, and the line chart shows the prediction accuracy of the testing set. As shown by the red line in
Figure 8, the prediction accuracy of most pests exceeds 90%, and some even reach 100%. Through image analysis, we found that these pests have a large size in the image and have significant differences in color and background. There are also obvious turning points on the red line, which correspond to lower accuracy. This means that our proposed model has a higher error rate in identifying such pests. We analyzed and found that the main reasons for this situation are the following three aspects: (1) The small size of pests in the images, such as
and
å
; (2) The background of the image where the pest is located is complex, such as
and
; (3) The image lacks sufficient lighting, such as
. The above issues have affected the performance of our model, which demonstrates its limitations. We will focus on these issues in our future research.
Our research has some limitations. Firstly, there is a contradiction between the limited sample size and the training sample requirements of deep learning models. Although data augmentation can alleviate the contradiction between the two, collecting more samples is the ultimate solution. In addition, mixer block is a time-consuming module, and the training time of this model is relatively long. Secondly, the heterogeneity between the source and target data in transfer learning limits the performance of transfer learning methods. Finally, the classification accuracy of complex backgrounds, insufficient lighting, and scattered and blurry targets still needs to be improved.