Multi-Scale and Multi-Factor ViT Attention Model for Classification and Detection of Pest and Disease in Agriculture

Xie, Mingyao; Ye, Ning

doi:10.3390/app14135797

Open AccessArticle

Multi-Scale and Multi-Factor ViT Attention Model for Classification and Detection of Pest and Disease in Agriculture

by

Mingyao Xie

and

Ning Ye

^*

College of Information Science and Technology, Nanjing Forestry University, Nanjing 210037, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(13), 5797; https://doi.org/10.3390/app14135797

Submission received: 9 May 2024 / Revised: 1 July 2024 / Accepted: 2 July 2024 / Published: 2 July 2024

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Agriculture has a crucial impact on the economic, ecological, and social development of the world. More rapid and precise prevention and control work, especially for accurate classification and detection, is required due to the increasing severity of agricultural pests and diseases. However, the results of the image classification and detection are unsatisfactory because of the limitation of image data volume acquisition and the wide range of influencing factors of pests and diseases. In order to solve these problems, the vision transformer (ViT) model is improved, and a multi-scale and multi-factor ViT attention model (SFA-ViT) is proposed in this paper. Data augmentation considering multiple influencing factors is implemented in SFA-ViT to mitigate the impact of insufficient experimental data. Meanwhile, SFA-ViT optimizes the ViT model from a multi-scale perspective, and encourages the model to understand more features, from fine-grained to coarse-grained, during the classification task. Further, the detection model based on the self-attention mechanism of the multi-scale ViT is constructed to achieve the accurate localization of the pest and disease. Finally, experimental validation of the model, based on the IP102 and Plant Village dataset, is carried out. The results indicate that the various components of SFA-ViT effectively enhance the final classification and detection outcomes, and our model outperforms the current models significantly.

Keywords:

pest and disease in agriculture; data augmentation; classification; object detection; multi-scale; ViT

1. Introduction

Pests and diseases, as major calamities affecting the development of agriculture, frequently inflict substantial economic losses due to their diversity and propensity for rapid outbreak [1]. According to a report by the Food and Agriculture Organization (FAO) of the United Nations, pests and diseases globally result in an annual loss of approximately 20% to 40% of agricultural crops, posing a severe threat to the security of staple food crops such as maize, wheat, and rice [2]. In the face of increasingly severe pest and disease threats, it is critically important to undertake control efforts based on classification and detection work.

Traditional methods for the classification and detection of pests and diseases primarily rely on visual observation, manual sampling, and laboratory analysis, which are typically time-consuming and susceptible to the limitations imposed by human experience [3]. With the advancement of computer and image acquisition technologies, machine learning based techniques for pest and disease classification and detection have garnered considerable attention from numerous scholars [4,5,6,7]. Notably, deep learning technologies, owing to their potent capability for automatic feature extraction, high adaptability to large datasets, and superior robustness, have demonstrated more accurate and efficient image processing performance [8,9]. The basic concept is predicated on the utilization of neural networks, intricately composed of convolutional layers, pooling layers, and fully connected layers, for the purposes of data analysis and feature learning. These networks possess the ability to extract comprehensive features from images and, through the amalgamation of these features, procuring advanced information, which consequently facilitates the classification and detection of targets within images [10]. In the realm of pest and disease identification research, it is imperative to comprehensively extract information pertaining to color, texture, and morphology. Such information is often susceptible to constraints imposed by factors such as the stochastic distribution of lesions, the diversity in symptoms, and complex environmental conditions (e.g., lighting, camera angle, etc.). Upon alterations in the environment, the characteristics utilized for identification and detection may deviate from the established patterns, consequently precipitating a decline in precision [11,12].

Some scholars conduct relevant research by combining the distinctive features of agricultural pests and diseases with deep learning technology. Methods such as optimization of network architecture [13,14], refinement of loss functions [15], and integration of transfer learning [16] have been sequentially proposed in classification tasks, while approaches based on R-CNN [17,18], SSD [19], and YOLO [20,21] have also been subjected to experimentation in object detection tasks. However, these methodologies are frequently predicated upon niche datasets and may not be applicable to real-world scenarios characterized by complex background noise and encompassing a multitude of factors. Moreover, the occurrence scenarios of pests and diseases in agriculture are characterized by complex environments, where the outcomes of image data recognition and detection are susceptible to interference from background factors. This necessitates models with enhanced self-attention capabilities to improve the degree of attention in priority regions, an area in which the ViT model has demonstrated significant advantages [22,23]. This model applies the traditional transformer architecture to visual data, breaking down images into fixed-size patches and projecting these into a one-dimensional sequence for processing, thereby enabling the capture of the rich features and relationships inherent within the images [24]. However, the ViT exhibits certain deficiencies in classification and detection tasks, particularly due to its partitioning of images into fixed-size patches, which may constrain the model’s capacity to capture features across varying scales. Fixed-size patches may not furnish sufficient information for small targets or fine-grained features, whereas redundancy of information will occur for larger targets.

As a result, a multi-scale and multi-factor ViT attention model is proposed to surmount the limitations inherent in current classification and detection models. This approach leverages data augmentation techniques to integrate a broader spectrum of influencing factors from agricultural pest and disease image datasets and processes images at various dimensions, thereby facilitating the ViT in extracting information characteristics across multiple scales. Furthermore, the model uses data distillation techniques to combine feature information across multiple scales for the classification task, and utilizes attention mechanisms at multiple scales in conjunction with compact networks to accomplish the task of object detection. Ultimately, the efficacy and rationality of the model are corroborated through experimental validation.

2. Methods

2.1. Overview of SFA-ViT

SFA-ViT is proposed in this paper in response to the limitations of the ViT model in capturing features at varying scale. The overview of the SFA-ViT model is shown in Figure 1, which includes three core sub-models: a data augmentation model based on multiple factors, a classification model based on multi-scale ViT, and a detection model based on multi-scale attention.

Initially, the original dataset will be preprocessed by the data augmentation model, which enriches the data with an array of environmental, growth, and photographic factors, contributing to the robustness and generalization capability of subsequent models. Subsequently, the augmented dataset is transformed into three distinct scales and fed into the classification model for categorization, which integrates multi-scale feature information and enhances the classification outcomes. The three scales in this model are 224 × 224, 160 × 160, and 128 × 128. This multi-scale processing approach enables the model to capture information at various levels, from fine to coarse [25,26]. The extraction at smaller scales with lower resolutions helps the model focus on global and background information, while the larger scales with higher resolutions reduce the sacrifice of detail attention by the model. Finally, attention maps from the three scales of the classification model are extracted. These are then inputted into the detection model to accomplish object detection. The adoption of a multi-scale attention mechanism further improves the precision of object detection.

2.2. Data Augmentation Model Based on Multiple Factors

The classification and detection of pest and disease image is frequently affected by a variety of external factors in practical applications. Current datasets fall short of providing comprehensive coverage of these influences, resulting in an inability to fully acquire and recognize these diverse interference features during the model learning process. Collecting images through field photography is exceedingly challenging, owing to the impacts of varying climatic conditions, the diversity of pest and disease types, and the unpredictability of outbreak timings. Consequently, the data augmentation model based on multiple factors is proposed to address this issue, as depicted in Figure 2.

The external factors affecting classification and detection can be primarily categorized into environmental factors, growth factors, and filming factors through systematic analysis and organization. Environmental factors necessitate that images should incorporate more background elements, such as crops, soil, and water sources, as disturbances, while also possessing effective target occlusion interference. Growth factors demand that images establish the relationship between different stages of disease progression and pest growth phases (e.g., eggs, larvae, adult insects), thereby mitigating the impact of significant intra-class variability and minimal inter-class differences within the pest and disease datasets. Filming factors refer to the clarity, brightness, and shooting angle of the images.

The three factors are addressed by the data augmentation model proposed in this paper. Random regional cropping to the original images is applied in the model regarding environmental factors. The cropped area is 20% of the original image (it has been experimentally shown that the effect of the shape of the cropping region has less influence on the data augmentation, hence the rectangular cropping is used in this work), and the cropped segments are subsequently completed using DeepFill [27,28] to generate new image data. The new background information will be generated by the process of completing images, and the 20% random cropping mitigates the possibility of extensive coverage over critical information while also facilitating the production of an occlusion effect on the target. The principle of the mixup algorithm [29] is borrowed by the model for growth factors, where two or three images, randomly selected from within the class of the dataset, will be fused (when two images are fused, each image’s fusion weight is 0.5; when three images are fused, the weight for each is 1/3), and the fused labels will remain as labels of that class. This methodology is conducive to enhancing the degree of information correlation between intra-class images, facilitating the establishment of connections between stages of disease progression and growth phases. The model will randomly select two or three combinations of basic transformations (e.g., angle transformations with a random range of 1–360°; brightness, saturation, and contrast transformations with a random range of 0.5–1.5; blurring transformations with a random range of 1–2; scaling transformations with a random range of 0.5–2) for the filming factors and apply them to the original image to generate new image data. The construction of new data containing various variants is facilitated by the combination of image transformations, which drives the robustness of the model.

2.3. Classification Model Based on Multi-Scale ViT

The overview of the classification model based on multi-scale ViT is shown in Figure 3. The input to the model is images at 224 × 224, 160 × 160, and 128 × 128. Images at these scales are individually subjected to patching and position embedding, subsequently being fed into the ViT model. Then, the classification token (CLS) of the three scales will be obtained and the corresponding classification result will be achieved by the classifier, which can be represented by the following equation.

c = s o f t m a x (W_{f} \cdot z + b_{f}),

(1)

where

z

is the [CLS] data at the respective scale,

W_{f}

and

b_{f}

are the weights and biases of the fully connected layer, and

c

is the classification result vector.

Meanwhile, in the model, the [CLS] at the three scales will be concatenated to generate the overall [CLS], which will be fed equally into the classifier to obtain the final classification result. Knowledge distillation technique [30] is used in this model in order to drive the final classification results. The ViT models at each scale are used as independent teacher models in this framework, and the soft targets are output. Their predicted category probability distributions are passed as training signals to the integrated student model. Furthermore, the model combines two types of losses to establish the link between the teacher model and the student model. The hard target loss uses a cross-entropy loss that directly targets real category labels, ensuring that the student model can perform well on basic classification tasks. Smooth L1 loss is used for soft target loss, which reduces sensitivity to outliers and provides more stable gradients during training. The final loss function is shown in the following equation.

L = α \cdot L_{h a r d} + (1 - α) \cdot L_{s o f t},

(2)

L_{h a r d} = - \sum_{c = 1}^{M} y_{o, c} l o g (p_{o, c}),

(3)

L_{soft} = \frac{1}{N} \sum_{i} \{\begin{matrix} 0.5 \cdot {(x_{i} - x_{t_{i}})}^{2}, & if | x_{i} - x_{t_{i}} | < 1 \\ | x_{i} - x_{t_{i}} | - 0.5, & otherwise \end{matrix} .

(4)

where

α

is a hyperparameter used to balance the importance of the two losses,

M

is the total number of categories,

y_{o, c}

is a binary indicator (0 or 1) that is 1 if category

c

is the correct categorization of observation

o

,

p_{o, c}

is the probability that the model predicts observation

o

to be of category

c

,

x_{i}

denotes the prediction of the student model,

x_{t_{i}}

denotes the prediction of the teacher model, and

N

is the batch size.

2.4. Detection Model Based on Multi-Scale Attention

Figure 4 shows the specific structure of the detection model based on multi-scale attention, which needs to be implemented based on the classification model proposed in Section 2.3. In the multi-scale ViT models of the classification framework, the multiple self-attention mechanism permits the model to concurrently consider different subsets of information while processing images. It is possible to extract the attentional weights for each scale by using this property, which leads to the generation of the corresponding attentional maps. Attention values for the last round of the multi-attention mechanism in the ViT model are extracted and averaged, and finally the attention map corresponding to each scale is obtained.

Since then, a small network structure is used in the model to implement attention map fusion at three scales and generate a map that integrates information from all scales. The network structure can be divided into three steps: adjust size, map overlay, and convolution. The attention map for the three scales will be upsampled by bilinear interpolation in the first step to match the largest size. Then, all the attention maps (A1, A2, A3) resized to the same size are stacked along the channel dimension. Finally, 1 × 1 convolution is applied to integrate the features and reduce the number of channels to one. These operations integrate multilevel features and improve the perceptual power and flexibility of the model, which can be represented by the following equation, where

l, h

are the length and height of the largest scale.

A_{i}^{'} = U (A_{i}, l, h),

(5)

A_{f u s i o n} = C o n v (concat (A_{1}^{'}, A_{2}^{'}, A_{3}^{'})) .

(6)

Finally, another small network is designed to enable target localization and to generate the coordinates for locating the box. The network is designed with a convolutional layer whose output is passed through the ReLU activation function and then processed by the maximum pooling layer. Subsequently, the feature maps are spread and a fully connected layer is used to predict the four coordinate values of the bounding box. In an overall perspective, the network design achieves efficient extraction and localization of the targets from complex attention graphs through a sequential processing flow.

3. Experimental Setup

3.1. Datasets

In order to validate the method in this paper, two representative datasets (IP102 [31] and Plant Village [32], as shown in Table 1 and Figure 5), in the field of pest and disease classification and detection in agriculture, were selected. IP102 is a large-scale benchmark dataset for pest identification, which contains 75,222 images of eight crops, including rice, corn, wheat, sugar beet, alfalfa, grape, citrus, and mango, with a total of 102 pest categories. The dataset exhibits a diverse array of image forms, encompassing samples from various developmental stages of pest infestation, and demonstrates characteristics of imbalance and high variance, presenting a more challenging scenario. While Plant Village is a dataset of plant leaf images, consisting of 54,306 images of healthy and unhealthy leaves, grouped into 38 categories by species and type of disease (including 14 crops covering 17 fungal diseases, four bacterial diseases, two mycobacterial diseases, two viral diseases, and one disease caused by a mite versus 12 as healthy leaves).

The IP102 database provides segmentation rules that can be applied to the classification and object detection tasks. In the classification task, the number of training sets is 45,095, the number of validation sets is 7508, and the number of test sets is 22,619. In the object detection task, the number of images containing object annotations is 18,976, the number of training sets is 15,178, and the number of test machines is 3798. Plant Village is only suitable for classification tasks, and the ratio of 8:1:1 is used for automatic partitioning in this paper. The number of training sets is 43,429, the number of validation sets is 5417, and the number of test sets is 5459.

3.2. Experiment Detail

The model was constructed utilizing the Pytorch deep learning framework, wherein the ViT component employs pre-trained parameters from ImageNet. The size of the dataset images will be set to 224 × 224 pixels and the data augmentation is performed through the model in Section 2.2. Three experimental tasks were conducted in this study. In SFA-ViT, data augmentation is partially used to promote the final classification and detection effects. In order to analyze the effect of augmentation of each factor, data augmentation experiments were designed separately.

The data augmentation experiments were implemented on the IP102 and Plant Village datasets. The data augmentation model in Section 2.2 was used on the training set, and VGG, ResNet50, and ViT were chosen as base models to validate the effect of multiple factors. Accuracy (ACC) were used as metrics in this part.

The classification experiments were performed on the IP102 and Plant Village datasets, the original and augmented datasets were scaled to 224, 160, and 128 and entered into the classification model in Section 2.3 for testing. Accuracy (ACC), F1 score (F1), recall (R), precision (P), and the Matthews correlation coefficient (MCC) were used as metrics in this part of the experiment to comprehensively evaluate the classification performance of SFA-ViT.

The object detection experiments were conducted based on the IP102 dataset only. The multi-scale attention maps were extracted from the classification model trained on the augmented dataset and fed into the detection model in Section 2.4 to obtain the localization information. The mean Average Precision (mAP) was used as the core evaluation metric in this study, and the IoU was used as the threshold parameter for mAP. Three different mAP-derived metrics (mAP_0.5–0.95, mAP_0.5, mAP_0.75) were of particular interest

In addition, the Adam optimization algorithm was used to update the model parameters, and the learning rate was set to 0.0001. The

α

in knowledge distillation was set as 0.6. Each batch contained 64 images, the epoch was 80, and the experiments were performed on RTX 3090ti GPUs.

4. Results and Discussion

4.1. Data Augmentation Experiment

The influence of multiple factors and comprehensive impact on the results of the data augmentation model in SFA-ViT are presented in Table 2. The original models in the experiments were not augmented with data. The models for each factor were expanded by nine times the number of images relative to the original training set using the corresponding factor augmentation method. The complete model, on the other hand, was augmented with a nine-fold data augmentation in the ratio of 1:1:1 for each factor.

The experimental results indicate that the ViT model achieves the best performance results relative to the other base models, which may be attributed to the self-attention mechanism of ViT, inherently more adept at capturing global features, enabling it to make superior classification decisions on the original data.

In comparison to the scenario without augmentation, the models augmented with environmental, growth, and filming factors have seen an increase in accuracy across both datasets. Taking the experimental results of the ViT model on the IP102 dataset as an example, the classification accuracy improved by 1.28% after environmental factor augmentation, by 0.94% after growth, and by 0.9% after shooting factor augmentation. This indicates that the three factors in the model have, to different degrees, introduced a wider range of conditions, which has reduced the risk of the model becoming overly specialized to particular circumstances. Within the IP102 dataset, the augmentation of environmental factors produced the most favorable outcomes, which is likely because the dataset includes a substantial number of field images where environmental factor augmentation can improve the model’s judgment of background disturbances. On the other hand, in the Plant Village dataset, the augmentation of growth, rather than other factors, surpassed the others in effectiveness. This could be due to the dataset’s images all being laboratory-shot with a consistent background, thus making the enhancement of the relationship between disease severity levels more effective.

Furthermore, the most notable enhancement in model performance was achieved through the application of complete augmentation, suggesting that the combined impact of multiple factors aids the model in acquiring a wider array of feature information, thus improving its generalization and robustness. On the IP102 dataset, the classification accuracy of the VGG model improved by 1.67%, Resnet50 by 1.54%, and ViT by 1.71%. The outcomes on the Plant Village dataset also experienced improvements to varying extents. These results indicate that the data augmentation model demonstrates favorable practical efficacy, effectively addressing the issues of data imbalance and sample scarcity in agricultural pest and disease classification and detection, thereby reducing the economic costs of research.

4.2. Classification Experiment

The evaluation metrics of SFA-ViT in the classification task are given in Table 3. It is worth mentioning that on the IP102 dataset, the SFA-ViT achieved a 6.52 percentage point increase in accuracy and an 8.64 percentage point improvement in the F1 score compared to the ViT model in the absence of data augmentation. The performance improvement is also significant on the Plant Village dataset. This suggests that the integration of a multi-scale parallel network architecture with knowledge distillation substantially improves the performance of the ViT model on classification tasks, which is an indication that the classification model is effective. The improvement is likely due to the enhanced ability of the model to extract more effective target feature information. On the other hand, the combined application of the classification and data augmentation models in SFA-ViT further elevates the classification outcomes, indirectly demonstrating the efficacy of the data augmentation model.

Figure 6 illustrates the variation in classification accuracy of the model on two datasets as a function of the number of training epochs. The classification accuracy at each scale corresponds to the results obtained from the three individual scales [CLS] as described in Section 2.3. The results across the three scales reveal distinct patterns. Specifically, as the image resolution increases, so does the classification accuracy. This correlation is likely attributed to the richer feature information that high-resolution images offer, which can be more effectively captured and utilized by the model. However, the SFA-ViT achieves the highest final classification accuracy, surpassing the results of each individual scale. This achievement is credited to the technique of knowledge transfer, which encourages the model to fully learn the features from images of various resolution. It also suggests that low-resolution images still contain characteristics not present in high-resolution images, further emphasizing the importance of multi-scale synthesis. Moreover, the data depicted in the figure also substantiate the crucial role of the data augmentation sub-model within the architecture, which enhances the model’s generalization capabilities, allowing it to achieve superior performance in earlier epochs.

Table 4 presents a comparative analysis of the performance of the proposed model with other models on the IP102 and Plant Village datasets. Notably, the SFA-ViT model achieved an accuracy of 74.70% and an F1 score of 68.27% on the IP102 dataset. Similar trends were observed on the Plant Village dataset. These results further validate the superior performance of the SFA-ViT in the pest and disease image classification task, which enables a broader range of agricultural producers to conduct early detection and prevention of pests and diseases, thereby propelling the development of agricultural intelligence and precision.

4.3. Object Detection Experiment

The evaluation metrics of SFA-ViT in the object detection task are given in Table 5. It should be noted that D-ViT refers to a model that directly employs the target localization network described in Section 2.4 on the basis of the ViT model to achieve object localization. The attention map of the SFA-ViT model is extracted based on the attention mechanism in the classification model, and the classification network is trained on the IP102 classification dataset. The integration of data augmentation enhances the classification network, thereby improving the attention maps extracted by the detection model.

The experimental results indicate that the combination of the ViT model and the target localization network already possesses a certain level of object detection capability, albeit with limited performance. The SFA-ViT without augmentation has seen significant improvements in all three evaluation metrics, which demonstrates that extracting feature information from a multi-scale perspective has a significant impact on the model’s attention, enabling the model to focus on more appropriate regions. After introducing the data augmentation model, the SFA-ViT achieved the optimal effect, suggesting that the combination of the three models has further promoted each other, significantly enhancing the performance of object detection, especially the improvement in precise target localization.

To further validate the effectiveness of SFA-ViT, the test images and results as well as the attention map in the model were visualized in this study, as shown in Figure 7. It can be observed from the figure that the attention maps at three different scales exhibit varying focus areas. The attention map shows attention to more details as well as key features when the scale is large; the attention position of the model is more accurate for better localization at small scales. Moreover, the fused attention map integrates the advantages at various scales, possessing a more accurate location and also covering detailed situations. Ultimately, this model can effectively capture the target locations within the detection images.

Table 6 compares the performance of the proposed model with various object detection models on agricultural pest detection tasks. The comparison results indicate that the SFA-ViT model outperforms all other models across all evaluation metrics, achieving mAP_0.5–0.95 of 34.7%, reaching 56.8% in mAP_0.5, and leading with 37.6% in mAP_0.75. This highlights the significant advantages of the detection model’s multi-scale feature extraction and fusion mechanism, allowing for precise localization of pests and diseases within complex agricultural settings. Such precision empowers agricultural practitioners to implement more targeted and effective control strategies.

5. Conclusions

The SFA-ViT model proposed in this paper introduces three improvements to the ViT model, each implemented by a corresponding sub-model. The data augmentation model expands the dataset for agricultural pest and disease classification and detection by incorporating the effects of environmental, growth, and filming factors, thereby reducing the model’s reliance on real data. The classification model designs a multi-scale parallel structure based on the ViT model, and adopts the knowledge distillation technique to promote the fusion and optimization of the feature information of each scale, which obtains a better classification effect of pests and diseases. The detection model integrates the attention maps from the ViT model across multiple scales, further augmenting the model’s perceptual capacity and achieving effective target localization.

Besides, the model proposed in this paper achieves excellent results on the IP102 dataset (74.7% classification accuracy, 56.8% mAP₅₀) and the Plant Village dataset (99.87% classification accuracy), which significantly outperformed the performance of the existing base network.

The SFA-ViT introduces a multi-scale feature fusion mechanism on the basis of the ViT model, enhancing the model’s adaptability to complex scenes. In the field of agricultural pest detection, SFA-ViT not only accurately identifies the types of pests and diseases but also precisely locates the positions of pests and lesions, thereby promoting the development of precision agriculture. Additionally, the multi-scale processing capabilities of SFA-ViT hold cross-disciplinary application value, such as in industrial monitoring, environmental monitoring, and medical image analysis.

However, SFA-ViT exhibits weaknesses in terms of memory usage and computational efficiency. In future work, deep optimization of the model will be conducted, with a further focus on the research of classification and detection of pest and disease in agriculture.

Author Contributions

Conceptualization, M.X. and N.Y.; methodology, M.X.; software, M.X.; validation, M.X. and N.Y.; formal analysis, M.X.; investigation, M.X.; resources, M.X.; data curation, M.X.; writing—original draft preparation, M.X.; writing—review and editing, M.X. and N.Y.; visualization, M.X.; supervision, M.X.; project administration, M.X.; funding acquisition, N.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National key research and development plan of China, grant number 2016YFD0600101.

Data Availability Statement

Some or all data, models, or codes that support the findings of this study are available from the corresponding author upon reasonable request. The data are not publicly available due to privacy.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Strange, R.N.; Scott, P.R. Plant Disease: A Threat to Global Food Security. Annu. Rev. Phytopathol. 2005, 43, 83–116. [Google Scholar] [CrossRef]
FAO. Tracking Progress on Food and Agriculture-Related SDG Indicators 2023; Food and Agriculture Organization of the United Nations: Rome, Italy, 2023. [Google Scholar]
Tee, C.A.T.; Teoh, Y.X.; Yee, L.; Tan, B.C.; Lai, K.W. Discovering the Ganoderma boninense detection methods using machine learning: A review of manual, laboratory, and remote approaches. IEEE Access 2021, 9, 105776–105787. [Google Scholar] [CrossRef]
Muppala, C.; Guruviah, V. Machine vision detection of pests, diseases and weeds: A review. J. Phytol. 2020, 12, 9–19. [Google Scholar] [CrossRef]
Qing, Y.; Xian, D.X.; Liu, Q.J.; Yang, B.J.; Diao, G.Q.; Jian, T. Automated counting of rice planthoppers in paddy fields based on image processing. J. Integr. Agric. 2014, 13, 1736–1745. [Google Scholar]
Rajan, P.; Radhakrishnan, B.; Suresh, L.P. Detection and classification of pests from crop images using support vector machine. In Proceedings of the 2016 International Conference on Emerging Technological Trends (ICETT), Kollam, India, 21–22 October 2016; pp. 1–6. [Google Scholar]
Schor, N.; Bechar, A.; Ignat, T.; Dombrovsky, A.; Elad, Y.; Berman, S. Robotic disease detection in greenhouses: Combined detection of powdery mildew and tomato spotted wilt virus. IEEE Robot. Autom. Lett. 2016, 1, 354–360. [Google Scholar] [CrossRef]
Liu, J.; Wang, X. Plant diseases and pests detection based on deep learning: A review. Plant Methods 2021, 17, 1–18. [Google Scholar] [CrossRef]
Shoaib, M.; Shah, B.; Ei-Sappagh, S.; Ali, A.; Ullah, A.; Alenezi, F.; Gechev, T.; Hussain, T.; Ali, F. An advanced deep learning models-based plant disease detection: A review of recent research. Front. Plant Sci. 2023, 14, 1158933. [Google Scholar]
Hinton, G.E.; Salakhutdinov, R.R. Reducing the dimensionality of data with neural networks. Science 2006, 313, 504–507. [Google Scholar] [CrossRef]
Nettleton, D.F.; Katsantonis, D.; Kalaitzidis, A.; Sarafijanovic-Djukic, N.; Puigdollers, P.; Confalonieri, R. Predicting rice blast disease: Machine learning versus process-based models. BMC Bioinform. 2019, 20, 1–16. [Google Scholar] [CrossRef] [PubMed]
Duarte-Carvajalino, J.M.; Alzate, D.F.; Ramirez, A.A.; Santa-Sepulveda, J.D.; Fajardo-Rojas, A.E.; Soto-Suárez, M. Evaluating late blight severity in potato crops using unmanned aerial vehicles and machine learning algorithms. Remote Sens. 2018, 10, 1513. [Google Scholar] [CrossRef]
Nagasubramanian, K.; Jones, S.; Singh, A.K.; Sarkar, S.; Singh, A.; Ganapathysubramanian, B. Plant disease identification using explainable 3D deep learning on hyperspectral images. Plant Methods 2019, 15, 1–10. [Google Scholar] [CrossRef]
Picon, A.; Seitz, M.; Alvarez-Gila, A.; Mohnke, P.; Ortiz-Barredo, A.; Echazarra, J. Crop conditional Convolutional Neural Networks for massive multi-crop plant disease classification over cell phone acquired images taken on real field conditions. Comput. Electron. Agric. 2019, 167, 105093. [Google Scholar] [CrossRef]
Fang, T.; Chen, P.; Zhang, J.; Wang, B. Crop leaf disease grade identification based on an improved convolutional neural network. J. Electron. Imaging 2020, 29, 013004. [Google Scholar] [CrossRef]
Thenmozhi, K.; Reddy, U.S. Crop pest classification based on deep convolutional neural network and transfer learning. Comput. Electron. Agric. 2019, 164, 104906. [Google Scholar] [CrossRef]
Fuentes, A.; Yoon, S.; Kim, S.C.; Park, D.S. A robust deep-learning-based detector for real-time tomato plant diseases and pests recognition. Sensors 2017, 17, 2022. [Google Scholar] [CrossRef]
Ozguven, M.M.; Adem, K. Automatic detection and classification of leaf spot disease in sugar beet using deep learning algorithms. Phys. A Stat. Mech. Its Appl. 2019, 535, 122537. [Google Scholar] [CrossRef]
Sun, J.; Yang, Y.; He, X.; Wu, X. Northern maize leaf blight detection under complex field environment based on deep learning. IEEE Access 2020, 8, 33679–33688. [Google Scholar] [CrossRef]
Hu, W.; Hong, W.; Wang, H.; Liu, M.; Liu, S. A Study on Tomato Disease and Pest Detection Method. Appl. Sci. 2023, 13, 10063. [Google Scholar] [CrossRef]
Peng, Y.; Wang, Y. Leaf disease image retrieval with object detection and deep metric learning. Front. Plant Sci. 2022, 13, 963302. [Google Scholar] [CrossRef]
Hechen, Z.; Huang, W.; Zhao, Y. ViT-LSLA: Vision transformer with light self-limited-attention. arXiv 2022, arXiv:2210.17115. [Google Scholar]
Chang, B.; Wang, Y.; Zhao, X.; Li, G.; Yuan, P. A general-purpose edge-feature guidance module to enhance vision transformers for plant disease identification. Expert Syst. Appl. 2024, 237, 121638. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 1–11. [Google Scholar]
Chen, Z.; Zhou, H.; Lin, H.; Bai, D. TeaViTNet: Tea Disease and Pest Detection Model Based on Fused Multiscale Attention. Agronomy 2024, 14, 633. [Google Scholar] [CrossRef]
Yang, T.; Wang, Y.; Lian, J. Plant Diseased Lesion Image Segmentation and Recognition Based on Improved Multi-Scale Attention Net. Appl. Sci. 2024, 14, 1716. [Google Scholar] [CrossRef]
Yu, J.; Lin, Z.; Yang, J.; Shen, X.; Lu, X.; Huang, T.S. Generative image inpainting with contextual attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 5505–5514. [Google Scholar]
Yu, J.; Lin, Z.; Yang, J.; Shen, X.; Lu, X.; Huang, T.S. Free-form image inpainting with gated convolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 4471–4480. [Google Scholar]
Zhang, H.; Cisse, M.; Dauphin, Y.N.; Lopez-Paz, D. mixup: Beyond empirical risk minimization. arXiv 2017, arXiv:1710.09412. [Google Scholar]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
Wu, X.; Zhan, C.; Lai, Y.K.; Cheng, M.M.; Yang, J. Ip102: A large-scale benchmark dataset for insect pest recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 8787–8796. [Google Scholar]
Hughes, D.; Salathé, M. An open access repository of images on plant health to enable the development of mobile disease diagnostics. arXiv 2015, arXiv:1511.08060. [Google Scholar]
Zhou, S.Y.; Su, C.Y. Efficient convolutional neural network for pest recognition-ExquisiteNet. In Proceedings of the 2020 IEEE Eurasia Conference on IOT, Communication and Engineering (ECICE), Yunlin, Taiwan, 23–25 October 2020; pp. 216–219. [Google Scholar]
Liu, W.; Wu, G.; Ren, F. Deep multibranch fusion residual network for insect pest recognition. IEEE Trans. Cogn. Dev. Syst. 2020, 13, 705–716. [Google Scholar] [CrossRef]
Nanni, L.; Maguolo, G.; Pancino, F. Insect pest image detection and recognition based on bio-inspired methods. Ecol. Inform. 2020, 57, 101089. [Google Scholar] [CrossRef]
Ayan, E.; Erbay, H.; Varçın, F. Crop pest classification with a genetic algorithm-based weighted ensemble of deep convolutional neural networks. Comput. Electron. Agric. 2020, 179, 105809. [Google Scholar] [CrossRef]
Luo, Q.; Wan, L.; Tian, L.; Li, Z. Saliency guided discriminative learning for insect pest recognition. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN), Virtual Conference, 18–22 July 2021; pp. 1–8. [Google Scholar]
Enes, A. Genetic Algorithm-Based Hyperparameter Optimization for Convolutional Neural Networks in the Classification of Crop Pests. Arab. J. Sci. Eng. 2024, 49, 3079–3093. [Google Scholar]
Srabani, B.; Ipsita, S.; Abanti, D. Plant disease identification using a novel time-effective CNN architecture. Multimed. Tools Appl. 2024, 1, 1–23. [Google Scholar]
Batchuluun, G.; Nam, S.H.; Park, K.R. Deep learning-based plant-image classification using a small training dataset. Mathematics 2022, 10, 3091. [Google Scholar] [CrossRef]
Sowmiya, B.; Saminathan, K.; Chithra, D.M. An Ensemble of Transfer Learning based InceptionV3 and VGG16 Models for Paddy Leaf Disease Classification. ECTI Trans. Comput. Inf. Technol. 2024, 18, 89–100. [Google Scholar]
Mohanty, S.P.; Hughes, D.P.; Salathé, M. Using deep learning for image-based plant disease detection. Front. Plant Sci. 2016, 7, 215232. [Google Scholar] [CrossRef]
Wagle, S.A.; Harikrishnan, R.; Ali, S.H.M.; Faseehuddin, M. Classification of plant leaves using new compact convolutional neural network models. Plants 2021, 11, 24. [Google Scholar] [CrossRef]
Vo, H.T.; Quach, L.D.; Hoang, T.N. Ensemble of deep learning models for multi-plant disease classification in smart farming. Int. J. Adv. Comput. Sci. Appl. 2023, 14, 1054. [Google Scholar] [CrossRef]
Sun, P.; Zhang, R.; Jiang, Y.; Kong, T.; Xu, C.; Zhan, W.; Tomizuka, M.; Li, L.; Yuan, Z.; Wang, C.; et al. Sparse R-CNN: End-to-End Object Detection with Learnable Proposals. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Feng, C.; Zhong, Y.; Gao, Y.; Scott, M.R.; Huang, W. TOOD: Task-Aligned One-Stage Object Detection. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021. [Google Scholar]
Kim, K.; Lee, H.S. Probabilistic anchor assignment with iou prediction for object detection. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 355–371. [Google Scholar]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Zhang, H.; Chang, H.; Ma, B.; Wang, N.; Chen, X. Dynamic R-CNN: Towards high quality object detection via dynamic training. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 260–275. [Google Scholar]

Figure 1. Overview of SFA-ViT, which includes three key sub-models: a data augmentation model based on multiple factors, a classification model based on multi-scale ViT, and a detection model based on multi-scale attention.

Figure 2. The multiple factors (environmental factor, growth factor, and filming factor) of the data augmentation model.

Figure 3. The structure of the classification model based on multi-scale ViT. Below is the multi-scale parallel network structure, and above is the classification module based on knowledge distillation. *—patch and position embedding, F—16 in hexadecimal.

Figure 4. Overview of detection model based on multi-scale attention. Bottom left corner: Fusion of multi-head attention maps. Green part: Attention map fusion network. Yellow part: Target localization network.

Figure 5. Image samples of (a) IP102 dataset in classification task, (b) IP102 dataset in object detection task, (c) Plant Village dataset in classification task.

Figure 6. The variation in accuracy of SFA-ViT with training epoch in classification task under (a) original IP102, (b) augmented IP102, (c) original Plant Village, (d) augmented Plant Village. (‘val_acc’ is the accuracy of SFA-ViT. ‘val_224-acc’, ‘val_160-acc’, and ‘val_128-acc’ represents the accuracy of results under three scales as shown in Figure 3 in SFA-ViT.).

Figure 7. Attention maps and object detection results of SFA-ViT.

Table 1. Information of dataset IP102 and Plant Village.

Dataset	Classes	Task ¹	Samples	Train	Val	Test
IP102	102	CR	75,222	45,095	7508	22,619
IP102	102	OD	15,178	18,976	N/A	3798
Plant village	38	CR	54,306	43,429	5417	5459

¹ CR means classification, OD means object detection.

Table 2. The accuracy (%) of different data augmentation model.

Model ¹	IP102			Plant Village
Model ¹	VGG	Resnet50	ViT	VGG	Resnet50	ViT
O	63.98	65.75	67.25	98.99	99.38	99.63
E	65.01	66.87	68.53	99.25	99.46	99.68
G	64.87	66.57	68.19	99.42	99.52	99.76
F	64.69	66.62	68.15	99.26	99.50	99.70
C	65.65	67.29	68.96	99.60	99.63	99.82

¹ O is the original model, E is the environmental factor model, G is the growth factor model, F is the filming factor model, and C is the complete model.

Table 3. Effectiveness analysis of classification model and data augmentation model in SFA-ViT.

Dataset	Model ¹	ACC (%)	F1 (%)	R (%)	P (%)	MCC
IP102	ViT (O)	67.25	58.87	58.39	63.05	0.6638
	SFA-ViT (O)	73.77	67.51	69.19	66.49	0.7303
	SFA-ViT (A)	74.70	68.27	69.86	67.30	0.7399
Plant Village	ViT (O)	99.63	99.38	99.36	99.42	0.9962
	SFA-ViT (O)	99.84	99.70	99.72	99.68	0.9983
	SFA-ViT (A)	99.87	99.79	99.78	99.81	0.9986

¹ O means model is based on original dataset, A means model is based on augmented dataset.

Table 4. Comparison results of different models in classification task.

Dataset	Model	Accuracy (%)	F1 (%)
IP102	ExquisiteNet [33]	52.32	N/A
	DMF-ResNet [34]	59.22	58.37
	SaliencyEnsemble [35]	61.93	N/A
	GAEnsemble [36]	67.13	65.76
	SGDL-Net [37]	71.16	63.89
	IRNV2 [38]	71.84	64.06
	SFA-ViT (ours)	74.70	68.27
Plant Village	Srabani B. et al. [39]	95.17	95.11
	PI-CNN [40]	97.26	92.38
	TL-Ensemble [41]	98.87	N/A
	DL-GoogleNet [42]	99.35	99.34
	Improved-AlexNet [43]	99.73	N/A
	EfficientNetB0-MobileNetV2 [44]	99.77	N/A
	SFA-ViT (ours)	99.87	99.79

Table 5. Effectiveness analysis of object detection model and data augmentation model in SFA-ViT.

Model	mAP_0.5–0.95 (%)	mAP_0.5 (%)	mAP_0.75 (%)
D-ViT (O)	30.1	50.9	29.3
SFA-ViT (O)	34.2	55.9	37.2
SFA-ViT (A)	34.7	56.8	37.6

Table 6. Comparison results of different models in object detection task.

Model	mAP_0.5–0.95 (%)	mAP_0.5 (%)	mAP_0.75 (%)
Sparse R-CNN [45]	23.8	33.2	21.1
FPN [46]	28.1	54.9	23.3
TOOD [47]	28.7	43.9	26.5
PAA [48]	25.2	42.7	26.1
YOLOX [49]	31.1	52.1	32.3
Dynamic R-CNN [50]	29.4	50.7	30.3
SFA-ViT (ours)	34.7	56.8	37.6

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xie, M.; Ye, N. Multi-Scale and Multi-Factor ViT Attention Model for Classification and Detection of Pest and Disease in Agriculture. Appl. Sci. 2024, 14, 5797. https://doi.org/10.3390/app14135797

AMA Style

Xie M, Ye N. Multi-Scale and Multi-Factor ViT Attention Model for Classification and Detection of Pest and Disease in Agriculture. Applied Sciences. 2024; 14(13):5797. https://doi.org/10.3390/app14135797

Chicago/Turabian Style

Xie, Mingyao, and Ning Ye. 2024. "Multi-Scale and Multi-Factor ViT Attention Model for Classification and Detection of Pest and Disease in Agriculture" Applied Sciences 14, no. 13: 5797. https://doi.org/10.3390/app14135797

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Scale and Multi-Factor ViT Attention Model for Classification and Detection of Pest and Disease in Agriculture

Abstract

1. Introduction

2. Methods

2.1. Overview of SFA-ViT

2.2. Data Augmentation Model Based on Multiple Factors

2.3. Classification Model Based on Multi-Scale ViT

2.4. Detection Model Based on Multi-Scale Attention

3. Experimental Setup

3.1. Datasets

3.2. Experiment Detail

4. Results and Discussion

4.1. Data Augmentation Experiment

4.2. Classification Experiment

4.3. Object Detection Experiment

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI