1. Introduction
Agriculture is a key pillar of the global economy, providing essential food, income, economic gains, and employment. Societies worldwide continuously adopt advanced technologies in agriculture to increase food production and meet the growing demands of an expanding population [
1]. Plant diseases, caused by fungi, viruses, protozoa, and bacteria [
2,
3], pose significant risks. Mismanagement or delayed response to these diseases can result in outbreaks or pandemics, severely affecting plant health, structural integrity, yield, and economic returns. A critical challenge in plant protection is the timely identification of disease symptoms [
4], which is crucial for implementing effective control measures. Early detection forms the foundation of successful plant disease management and is integral to agricultural decision-making. Recently, the identification of plant diseases has gained increasing importance.
Plant leaves are key indicators of plant health, as early symptoms of disease often appear on the leaves [
5]. Detecting damage at an early stage is vital to prevent the spread of the disease to other parts of the plant. Visual assessment of leaves plays an essential role in the early detection of plant diseases and the prevention of crop yield losses. In many regions, particularly in developing countries, traditional methods for classifying plant diseases are still in use. However, these methods, often reliant on farmers’ experience, are time-consuming, labor-intensive, and inefficient [
6,
7]. They are more prone to misjudgment during the identification process. Since different diseases require different treatment methods, inaccurate disease identification can lead to improper medication. Without proper guidance on the use of agricultural chemicals (such as fertilizers, pesticides, herbicides, etc.), excessive use of these chemicals may occur, resulting in environmental pollution and economic losses. During outbreaks, agricultural experts and botanists must travel to affected areas to provide guidance, which requires significant manpower and time [
8]. As a result, automated plant disease recognition has emerged as a critical research area, offering significant benefits for large-scale crop monitoring and the early detection of leaf-based disease symptoms [
9,
10,
11].
In earlier years, machine learning techniques were employed to classify plant diseases, addressing the lack of human expertise. Many of these studies utilized image recognition to categorize images into healthy and diseased categories using specific classifiers. The major classification techniques widely used in previous studies for plant disease recognition include K-nearest neighbor (KNN) [
12], support vector machine (SVM) [
13], artificial neural network (ANN) [
14], and random forest (RF) [
15]. Although machine learning has made significant advancements in plant disease classification, there are still some limitations and shortcomings. Most machine learning algorithms are developed in laboratory environments. For certain types of diseases, manual observation by personnel with extensive agricultural knowledge is required. Additionally, these methods are often limited to small datasets [
16].
Deep learning (DL) methods have seen increasing application in plant disease recognition, driven by the availability of large datasets and advanced computational resources. Early research primarily focused on evaluating standard CNN architectures to determine the most effective models for this task. Mohanty et al. [
17] employed the PlantVillage dataset, comprising 38 classes, alongside classical network models AlexNet and GoogLeNet. They utilized transfer learning and initialization training techniques to classify plant disease images, with GoogLeNet achieving a remarkable accuracy of 99.35% through transfer learning. Sagar et al. [
18] leveraged pretrained models, including InceptionV3, InceptionResNetV2, ResNet50, MobileNet, and Densenet169, to classify plant disease images from the PlantVillage dataset, fine-tuning the final layer of each network model. Among these, ResNet50 demonstrated the highest accuracy at 98.2%.
Beyond standard CNN architectures, several specialized models have been developed for plant disease recognition. Mohanty, Hughes, and Marcel [
17] trained a deep learning model capable of recognizing 14 crop species and 26 crop diseases, achieving an accuracy of 99.35%. Ma et al. [
19] employed a deep CNN to identify symptoms of four cucumber diseases—downy mildew, anthracnose, powdery mildew, and target leaf spot—achieving an accuracy of 93.4%. Kawasaki et al. [
20] introduced a CNN-based system for cucumber leaf disease recognition, which attained an accuracy of 94.9%. Gokulnath et al. [
21] developed a CNN model with only three convolutional layers specifically for detecting and recognizing diseases in tomato and potato species. Additionally, Keceli et al. [
22] created a multitask learning model by integrating a custom CNN model with a pretrained AlexNet model, aiming to predict plant species and their associated diseases. Their model was built using images of tomato, potato, pepper, and corn species from the PlantVillage dataset.
Some researchers have integrated attention mechanisms with CNN models, demonstrating excellent performance in plant disease recognition [
23]. Pandey and Jain [
24] recently proposed a CNN model based on attention-intensive learning blocks. Li et al. [
25] proposed combining multiple extended convolutional and block attention modules with the DenseNet architecture for the purpose of disease classification. Despite the promising results reported in the literature, convolutional neural networks (CNNs) have inherent limitations. The convolutional layer of a CNN only considers local region features during the convolution process and does not explicitly incorporate pixel location information. This limitation can affect the effectiveness of plant disease recognition models.
The introduction of the vision transformer (ViT) [
26] has revolutionized the field of computer vision [
26,
27]. ViT has shown remarkable classification performance on several benchmark datasets, such as Stanford Cars, ImageNet, CIFAR-10, CIFAR-100, Flowers-102, and so on. Motivated by its remarkable performance, researchers have investigated the application of ViT in plant disease detection modeling [
28,
29,
30,
31]. ViT is particularly effective in capturing global feature dependencies. For the purpose of detecting plant diseases, Li et al. suggested a model that included 12 transformer blocks after an initial convolutional layer, achieving 96.71% accuracy on the PlantVillage dataset.
Despite these advancements, plant disease recognition continues to present significant challenges due to the vast number of disease species and the wide variety of crops. The task is made more difficult by the commonality of disease symptoms and the dynamic nature of disease patterns. Large-parameter deep learning models necessitate extensive datasets, while lightweight models often struggle with generalization. Furthermore, deep learning models’ limited capacity to generalize is a result of the lack of field data available for training [
32]. Additionally, most current methods for plant disease identification do not adequately address the interpretability of their results.
3. Materials and Methods
The model consists of
N layers of an encoder–decoder, where each layer accepts two inputs and produces two outputs. The final layer generates tokens that are fed into the classification layer to produce output probabilities. First, a brief description of the vision transformer (ViT) [
26] is provided.
3.1. Vision Transformer
The transformer architecture [
26,
33] was initially introduced for Natural Language Processing (NLP) tasks, where it delivered substantial advancements. Building on the success of transformers in NLP, researchers extended its application to computer vision tasks. The vision transformer (ViT) is a transformer-based model specifically designed for image classification, directly applied to sequences of image patches.
In ViT, a given image is split into a series of flat patches , where c is the number of channels, (h, w) is the original image size, (, ) is the resolution of each patch image, and is the number of patches. Each patch is then linearly projected into the patch embedding to generate a series of patch embedding where d is the model dimension size. To obtain the position information, the learnable position embeddings are added to the patch embedding sequence frames . Finally, a learnable class embedding is connected with the patch and position embedding sequences to obtain the token input sequence for the ViT backbone.
The ViT layer consists of multi-head self-attention (MSA) blocks [
26,
34] and feedforward network (FFN) blocks, using layer norm (LN) before each block and adding residual connections after each block:
where
.
The self-attention mechanism consists of three linear layers that map tokens to three intermediate representations, query
, key
, and value
. The simplified computational flowchart is shown in
Figure 1b. Self-attention is computed as follows:
The FFN block is used after the multi-headed self-attention block. It consists of two linear transformations and a nonlinear activation function which can be expressed as
where
and
are the learnable matrices of the two linear transformations, and
denotes the activation function
3.2. Flaws of the ViT Architecture
The ViT architecture model leverages self-attention to learn the relationships between tokens, which can be broken down into three parts. First, the attention between patch tokens and position tokens is used to represent the features of the image. Second, the attention of the [CLS] tokens [
35] to both patch tokens and position tokens reflects the importance of different features for classification. The third part is the attention of patch and position tokens to the [CLS] tokens. In summary, the ViT architecture combines self-attention between patch/position tokens and cross-attention between class tokens and patch/position tokens. However, this last part is often unnecessary or even detrimental, as it introduces irrelevant information and increases the model’s complexity [
36].
We propose separating [CLS] tokens from patch/position tokens by using the original self-attention layer to capture the attention between patch tokens and position tokens. Additionally, we introduce a new cross-attention layer to extract the attention of [CLS] tokens to patch tokens and position tokens.
3.3. Cross-Attention Mechanism and Dot Product Scale
In the cross-attention mechanism, two input sequences of the same dimension are asymmetrically combined [
37], with one sequence serving as query
, and the other as key
and value
inputs. The simplified computational flowchart is shown in
Figure 1a. Cross-attention is computed as follows:
We found that scaling is required when performing dot product operations [
38], which produce larger values as the key
, and query
dimension d increases. This can cause the softmax function to produce extreme outputs (close to 0 or 1), leading to the vanishing gradient problem. The softmax function is defined as
where
represents the dot product result. The softmax output can be derived with respect to the input
:
When becomes large, increases significantly, driving the softmax output for toward 1, while the outputs for other approach 0. In such cases, the gradient becomes very small (close to 0), resulting in the vanishing gradient problem. Therefore, scaling the dot product is essential. By scaling, the variance of the dot product is normalized to 1, preventing the numerical instability that leads to vanishing gradients.
3.4. Model of Encoder–Decoder Architecture
We designed an encoder–decoder image transformer (EDIT) architecture, where the encoder structure is used for self-attention between patches, and the decoder structure extracts the attention of [CLS] tokens to patches and position tokens.
Our architecture differs from the traditional encoder–decoder architecture. In a traditional encoder–decoder architecture, the encoder takes source tokens as input and generates hidden representations of these tokens layer by layer, from low to high levels. The decoder takes the representation from the last layer of the encoder and generates hidden representations of the [CLS] tokens, also from low to high levels.
In our architecture, the encoder and decoder are compared layer by layer. Both the encoder and decoder in our model have the same number of layers, with layer
i of the decoder aligned and coordinated with layer
i of the encoder, as shown in
Figure 2. The hidden representations of the source tokens in layer
i are generated using the self-attention mechanism, while the hidden representations of the target tokens in layer
i are generated from the hidden representations of all the source tokens and the preceding target tokens in layer
i−1 using the cross-attention mechanism. Consequently, the decoder can extract information from the source markers layer by layer, starting from the lower representation layers and utilizing the finer-grained source information.
Our model is illustrated in
Figure 2. The upper part of
Figure 2 represents the entire architecture. Plant disease images are processed into smaller image patches through the patch embedding block, and then the Tokenizer block encodes these image patches, converting them into token sequences. The token sequences and the [CLS] token serve as two separate inputs to the backbone network. The backbone consists of N identical layers. Each layer accepts two inputs and generates two outputs, which are then passed as inputs to the next layer. The [CLS] token output from the final layer is fed into the Linear and softmax blocks to produce output probabilities. The lower part of
Figure 2 shows two aligned layers, namely the encoder layer and the decoder layer. The encoder layer is exactly the same as in the ViT model, consisting of two sub-layers. The first sub-layer is the multi-head self-attention mechanism, and the second sub-layer is a simple fully connected feedforward network, with layer normalization applied before each sub-layer. Residual connections are applied around both sub-layers. The encoder layer takes the disease token sequence as input and processes it through the two sub-layers, with the multi-head self-attention mechanism capturing relationships between the token sequences. The decoder layer uses a cross-attention mechanism, taking the output of its aligned encoder layer along with the output from the previous decoder layer as input. The cross-attention mechanism, through the [CLS] token, learns the features of the disease token sequence and is used to acquire the disease category features.
3.5. Data Acquisition and Preprocessing
Three publicly available datasets were used in the study. These datasets were collected from different environments with the aim of training the proposed method on various crops and their diseases and evaluating its performance in diverse test scenarios (see
Table 1 and
Figure 3).
A total of 54,305 images of 14 distinct plant species spread across 38 categories—12 healthy and 26 diseased—are contained in the PlantVillage collection [
39]. This dataset was created by the International Institute of Tropical Agriculture and the Penn State College of Agricultural Sciences. It is a useful tool for research and the development of computer-vision-based plant disease detection systems. The images are sourced from diverse environments, including research institutions and contributions from citizen scientists, covering a broad spectrum of plant species and disease types. For example, apple scab and black rot, potato early blight and black rot, and tomato leaf curl, spot, and black rot.
The FGVC8 dataset contains 18,632 high-quality and uniformly sized RGB images belonging to 12 categories. The dataset includes images of real field sceneries with non-uniform background leaf images obtained at different times of day and maturation stages with varying focal length camera settings.
The EMBRAPA dataset [
40] contains 46,376 images in 93 categories, each subdivided according to specific criteria. The background was manually removed from all images prior to segmentation, and new images resulting from segmentation were created with healthy tissue comprising at least 20% of the total area to ensure a sharp contrast with diseased tissue. The majority of diseases represented in the images were associated with fungi (77%), followed by viruses (8%), pests (6%), bacteria (3%), phytotoxicity (2%), algae (2%), nutrient deficiencies (1%), and senescence (1%) [
40].
Since the images from the three datasets were collected in different environments, preprocessing is required after obtaining the leaf images. First, each dataset is divided into training and validation sets in an 8:2 ratio, as shown in
Table 1. Then, the resolution of all images is standardized to 256 × 256 pixels, followed by cropping to 224 × 224 pixels using the CenterCrop method from the torchvision package (0.19.1) to ensure uniform image dimensions. The cropped images are then normalized to reduce the impact of image quality variations on the model’s generalization performance. During the training phase, data augmentation techniques such as ColorJitter and RandAugment are employed to increase the diversity of the training data while preserving the overall image characteristics, thereby enhancing the model’s robustness and ensuring consistency and effective training across all data samples.
3.6. Evaluation Metrics
Each model used in the comparison was evaluated based on standard classification metrics. In our experiments, accuracy, precision, recall, and F1-score were employed as evaluation indicators to provide a comprehensive assessment of model performance. The definitions of TP, TN, FP, and FN are outlined as follows:
True Positive (TP): The actual class of the sample is positive, and the model correctly identifies it as positive.
False Negative (FN): The actual class of the sample is positive, but the model incorrectly identifies it as negative.
False Positive (FP): The actual class of the sample is negative, but the model incorrectly identifies it as positive.
True Negative (TN): The actual class of the sample is negative, and the model correctly identifies it as negative.
Accuracy: In the task of image classification, accuracy is a common performance metric. It can be used to represent the precision of the model, i.e., the number of correctly recognized by the model/total number of samples. The higher the accuracy of the model, the better the performance of the model.
Precision: It indicates the percentage of samples that are truly positive classes out of the samples that are recognized as positive classes by the model. It lies between 0 and 1.
Recall: The ratio of the number of samples correctly identified as positive class by the model to the total number of positive class samples. The higher the Recall, the more positive class samples are correctly predicted by the model and the better the model performs.
F1-Score: The average of the sum of precision and recall.
Specificity: It indicates the ratio of the number of samples that the model recognizes as negative classes to the total number of negative samples.
In subsequent sections, we also use the confusion matrix to evaluate the performance of the model. In addition, we evaluate the interpretability of the model predictions using gradient-weighted class activation maps (Grad-CAMs) [
41].
3.7. Optimization Method and Loss Function
To avoid overfitting, we use the AdamW optimizer and label smoothing on the cross-entropy loss. The exact selection process is detailed in
Section 3.2.
The AdamW algorithm is an extension of Adam, incorporating an L2 regularization term to address some of Adam’s issues through weight decay. The AdamW optimization algorithm offers several advantages. Adding the L2 regularization term helps control the size of the weights while maintaining gradient stability. Additionally, it effectively reduces the risk of overfitting, thereby improving the model’s generalization ability.
Label smoothing mitigates the negative impact caused by softmax and encourages the model to slightly focus on the weights of low-probability distributions. Unlike the standard cross-entropy loss function, label-smoothing cross-entropy loss involves every term in the calculation. The label-smoothing cross-entropy function is defined in Equation (15).
denotes the degree of smoothing,
n denotes the number of categories, and class denotes the current category.
3.8. Transfer Learning
Transfer learning is a technique that allows for the rapid training and enhancement of model performance on different but related problems by leveraging predictive modeling of similar issues. This approach permits the partial or complete reuse of pretrained models. Transfer learning involves applying a model trained for one problem to a different but related problem. In deep learning, this process entails reusing weights from multiple layers of a pretrained network in a new model, which can be retained, fine-tuned, or fully adjusted during the learning of the new task. This method enables deep neural networks to be trained with relatively little data.
Transfer learning [
42] refers to the idea of fine-tuning. This machine learning approach uses information gained from a model trained on one type of problem to train on another, similar task or domain [
43]. In deep learning, an initial layer is used to identify task-specific features. During the fine-tuning process, the last layers of the network trained by transfer learning can be replaced and then retrained for the target task. Although fine-tuning requires some learning, it is still much faster than training from scratch and achieves higher accuracy compared to models created completely from scratch.
In this paper, we use our own pretrained models learned on the ImageNet dataset and then transfer them to specific tasks trained on the target dataset.
4. Experimental Results and Analysis
4.1. Experimental Setup
All our experiments are performed on the AutoDL private cloud platform. The models were experimented on a single NVIDIA GeForce RTX 4090 GPU (From Santa Clara, CA, USA) with 24 GB of RAM. In the experiments, the models were implemented using the Python programming language, the PyTorch (2.2.1) deep learning framework, and the Timm library for training and evaluation. The software and hardware configurations are shown in
Table 2.
The model hyperparameters were adjusted slightly for each dataset, including learning rate, batch size, weight decay, and attention scaling. We utilized our pretrained models for transfer learning. For the PlantVillage dataset, we used a batch size of 64, a learning rate of 1 × 10−3, and attention scaling reduced by a factor of 4. For the FGVC8 dataset, the batch size was 128, with a learning rate of 5 × 10−4 and a four-fold reduction in attention scaling. For the EMBRAPA dataset, we used a batch size of 128, a learning rate of 1 × 10−3, and normal attention scaling. The weight decay was consistently set at 1 × 10−4, and the hidden dimension was 192.
4.2. Optimization Algorithm Comparison Experiment
To verify the effectiveness of the AdamW optimizer, we conducted comparison tests using our model. In these tests, we only changed the type of optimization algorithm while keeping other parameters constant. The optimization algorithms compared in the experiments were SGD, Adam, and AdamW. The results of the experiments are shown in
Table 3.
The AdamW optimization algorithm, a variant of the Adam optimizer, improves the weight decay calculation by incorporating both the adaptive gradient and momentum gradient mechanisms. Compared to the SGD optimization algorithm, AdamW demonstrates reduced sensitivity to the learning rate [
44]. As illustrated in
Figure 4, after 50 iterations of training, the training loss of the EDIT model with these three optimization algorithms stabilizes. The loss value of the model using the SGD optimization algorithm decreases slowly and exhibits poor convergence. In contrast, models employing the Adam and AdamW optimization algorithms show a more rapid decline in loss values, achieving lower and more consistent loss values. Furthermore, the model utilizing the AdamW optimization algorithm outperforms the Adam algorithm in terms of both loss reduction and convergence speed.
From the experimental results in
Table 3, it is evident that the model using the SGD optimization algorithm performs poorly on all three datasets, exhibiting lower accuracy. The models using the Adam and AdamW optimization algorithms produced similar results across the three datasets. However, the AdamW optimization algorithm consistently outperformed the Adam algorithm. These experimental results demonstrate that the AdamW optimization algorithm offers superior performance in deep learning model training compared to optimization algorithms such as SGD, primarily by enhancing the regularization method.
4.3. Performance Evaluation
We used the confusion matrix as a technique to measure the classification accuracy of the EDIT network model in order to assess its effectiveness in plant disease classification. A popular machine learning visualization tool is the confusion matrix, which compares the predicted class labels with the actual class labels for each instance to provide an overview of the model’s performance on the dataset. It shows how confident the model is in its forecasts. The confusion matrix can be used to construct metrics like recall, accuracy, precision, and F1-score, which provide a thorough assessment of the model’s classification performance.
Figure 5 displays the confusion matrix findings that were obtained by applying our model to the FGVC8 dataset validation.
Table 4 calculates the corresponding evaluation metrics. The confusion matrices and evaluation metrics for the other two datasets (PlantVillage and EMBRAPA) cannot be provided due to the vast number of categories in these datasets. The FGVC8 dataset, for example, shows that our proposed model has a reasonably good classification performance, and as can be seen in
Figure 5, the classifications of Healthy, Rust, and Scab are very accurate. However, there are still some cases of misclassification. Through comparative observation, frog eye leaf spot is found to be more correctly classified, while other diseases are poorly classified. We found that the majority of the incorrectly categorized samples had comparable epigenetic characteristics, i.e., spots; second, there are many distinct disease kinds in the complicated category of leaf diseases, making it challenging to differentiate them from those in other categories because of their similarities. We discovered that the majority of the misclassified samples had comparable epigenetic characteristics, such as spots. These are really major obstacles that affect all models equally, not just those that we proposed.
The training and validation accuracy graphs for all three datasets along with the evaluation metrics are presented in
Figure 6 and
Table 5. The graphs for the PlantVillage dataset are more convergent while the results for the training and validation data for the FGVC8 and EMBRAPA datasets are more varied. Since epigenetic features often share similarities between several types of diseases, samples in the FGVC8 dataset with numerous diseases are misclassified. The discrepancy in the EMBRAPA dataset may be due to a high degree of sample imbalance, with a small number of samples in individual diseases.
4.4. Selection of Attention Scaling Factor
In our model, cross-attention is one of the central parts, and the dot product is used to compute the relationship between and query . In practice, the dot product is scaled by the square root of the dimension of key , denoted as d. In practice, the dot product is scaled by the square root of the dimension of key k, denoted as . In this section, we discuss the impact of the scaling on the experimental results. We denote d as , where . For the other parameters, we keep them constant and only perform the scaling scale change for the experiment.
The experimental results presented in
Table 6 demonstrate that the size of the scaling factor (
) significantly influences the model’s performance. As the scaling factor varies, key metrics such as accuracy, validation loss, and precision exhibit fluctuations across different datasets.
In the PlantVillage dataset, decreasing the scaling factor results in relatively stable accuracy, consistently ranging between 99.59% and 99.89%, indicating a high degree of stability. In contrast, the FGVC8 dataset shows a more pronounced sensitivity to the scaling factor. As the scaling factor increases, accuracy progressively improves, peaking at 91.52%. Notably, at , accuracy reaches its maximum, while validation loss is minimized, suggesting that this scaling factor more effectively captures the relevant data features. In the EMBRAPA dataset, the highest accuracy is also observed at ; however, the validation loss is lower at . Although the accuracy improves marginally, the increased validation loss indicates that a larger scaling factor complicates model optimization. Furthermore, when , all three datasets experience a significant drop in accuracy, likely due to gradient explosion or vanishing gradients.
As illustrated in
Figure 7, controlling the scaling factor to regulate gradient magnitude and ensure stability is essential. Unstable gradients result in erratic training. The experimental findings indicate that the impact of the scaling factor on model performance varies across datasets. Therefore, in practical applications, selecting the appropriate scaling factor should be guided by the specific characteristics and size of the dataset to achieve optimal results.
4.5. Performance Comparison with Other Models
We designed an experiment to evaluate this model’s performance against other models for recognizing plant diseases. All models use the same experimental methodology and experimental environment and will be trained and validated on three datasets, PlantVillage, EMBRAPA, and FGVC8.
Among them, four of them are CNN models, and we also evaluate the models of the vision transformer. First, the weights of the models are initialized by the frozen weights of the migration-learning-based models, which are trained on the ImageNet dataset.
Table 7 gives the mentioned performance metrics comparing the quantitative performance results of all five models and our proposed model on three different datasets.
As shown in
Figure 8. In the PlantVillage dataset, our proposed model achieved the highest accuracy of 99.89%, significantly outperforming classical models such as ResNet, ViT-S, and InceptionV3, demonstrating its strong generalization ability on large datasets. Additionally, our model recorded the lowest validation loss on this dataset, indicating a better fit to the data. As shown in
Figure 9. In the FGVC8 dataset, there was considerable variation in the performance of different models. Our model achieved the highest accuracy of 91.52%, outperforming other models. In contrast, ViT-S performed poorly on this dataset, showing weak fitting capability, which suggests its limited ability to handle real-world background data. As shown in
Figure 10. In the EMBRAPA dataset, our proposed model achieved an accuracy of 97.42%, significantly surpassing other models and showcasing its clear advantage. By comparing accuracy, validation loss, precision, recall, F1-score, and specificity across different models, our model consistently delivered superior classification results, demonstrating its efficiency in identifying plant diseases across various scenarios.
Our proposed model is capable of extracting the aspects of each disease more effectively, according to the findings of a performance comparison with other models.
As shown in
Figure 11, Grad-CAMs further illustrate these cases. Samples from each validation set of the three datasets were used. It is noteworthy that on the Plantvillage dataset, the other models correctly identified the full diseased leaf with a considerably greater accuracy rate; however, the ViT-S model could hardly identify the right disease area for any of the three datasets. In addition, the SEnet model and the InceptionV3 model could identify the diseased portion of the leaf in the EMBRAPA dataset very well. As shown in
Figure 12, although the MobilenetV3 model recognition accuracy is not the best, its model parameters and GFLOPS are small enough. Upon comparing all of the models, it is evident that the proposed model is able to locate diseased plant leaf areas with clarity and progressively improve the capture of disease traits inside the targeted area. Meanwhile, we find that the proposed model can recognize plant leaf disease regions well both in the laboratory data background and in real scenarios, which further illustrates the feasibility of the model.
5. Discussion
The proposed model aims to advance AI-based solutions for plant disease identification and monitoring. Fast and accurate models are essential for timely disease detection and intervention, addressing critical food security issues. However, due to the high similarity among different diseases, even human experts struggle with precise identification, making lightweight models inadequate for capturing intricate disease details and providing accurate predictions. Although increasing model depth can enhance feature extraction, it introduces challenges such as gradient vanishing, overfitting, and the growing computational burden associated with deeper architectures. Additionally, these deeper models, while powerful, are not lightweight enough, making them difficult to deploy on resource-constrained devices such as mobile platforms.
Training deep models from scratch also incurs significant computational costs. Transfer learning [
42] offers a viable solution to these issues by improving accuracy while reducing training time. State-of-the-art deep learning models, fine-tuned with our proposed model, were analyzed for their performance in plant disease identification. These models are compared based on GFLPOS counts, parameter size, and accuracy. The data in
Table 5 demonstrate that the deep network model ResNet50 outperforms the shallower ResNet18, but this improvement comes at the cost of increased model size and computational complexity. Similarly, our proposed transformer model achieves better results than the CNN model, yet it is not as lightweight as desired for real-time applications or deployment in low-resource environments.
Despite the use of transfer learning, the model still requires a large number of images for training to produce reliable results [
45]. However, the exact number of images needed remains undefined, and this remains a key challenge. Furthermore, each plant can exhibit multiple diseases, making accurate image labeling difficult. The presence of rare diseases further increases the risk of misrecognition.
Moreover, the diversity of image capture environments across different datasets introduces additional limitations in model development. A significant proportion of datasets focus on single-leaf images, whereas real-life scenarios often involve multiple diseased leaves with complex backgrounds. This discrepancy can impair the model’s performance in practical settings. The datasets used in this study include instances of the same plant with multiple diseases as well as multiple plants with a single disease, adding complexity to the training and evaluation process.
In this paper, we utilized three datasets and achieved the best results, demonstrating that the proposed model is more suitable than existing models for plant disease recognition applications in precision agriculture. However, there are still challenges to be addressed in plant disease recognition and detection. For instance, the high degree of similarity in the appearance and texture of different plant diseases, along with the subtlety of initial disease symptoms on plant leaves, pose significant obstacles to the development of effective AI solutions.
6. Conclusions
Plant diseases represent a significant challenge to global agricultural development, with the potential to lead to the extinction of food crops in severe cases. The automatic diagnosis of plant diseases in the context of agricultural informatization is therefore urgently needed. Plant disease identification using traditional approaches is frequently expensive and time-consuming. The application of artificial intelligence to plant disease recognition is of paramount importance for the early detection of plant diseases. This study examines the potential of deep learning and transfer learning in addressing the challenge of plant disease recognition and proposes a novel deep learning architecture. It can be seen from the detailed analysis of performance metrics such as accuracy, validation loss, precision, recall, F1-score, and specificity that the model has demonstrated outstanding performance on three publicly available datasets with different scales and background conditions. The model’s accuracy on the FGVC8, PlantVillage, and EMBRAPA datasets reached 91.5%, 99.9%, and 97.4%, respectively. The model demonstrated superior performance compared to six advanced deep learning models in plant disease recognition. Furthermore, the interpretability of the model’s predictions was evaluated using the Grad-CAMs method, which revealed that the model possesses considerable interpretability.
Despite the excellent performance of the EDIT model in feature extraction and classification, it has not fully met expectations in addressing the challenges posed by complex backgrounds and uneven data distribution. In future research, we will focus on overcoming dataset-related challenges and aim to develop a real-time object recognition model. This model will be capable of effectively identifying targets in both simple and complex background scenarios, including real-world field conditions. Additionally, we plan to take a lightweight approach, deploying the optimized model in mobile applications to support more plant disease researchers and farmers. Ultimately, we aim to enhance agricultural productivity and promote increased food production.