Plant Disease Identification Based on Encoder–Decoder Model

Feng, Wenfeng; Sun, Guoying; Zhang, Xin

doi:10.3390/agronomy14102208

Open AccessArticle

Plant Disease Identification Based on Encoder–Decoder Model

by

Wenfeng Feng

^*,†

,

Guoying Sun

^†

and

Xin Zhang

School of Computer Science and Technology, Henan Polytechnic University, Jiaozuo 454003, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Agronomy 2024, 14(10), 2208; https://doi.org/10.3390/agronomy14102208

Submission received: 9 September 2024 / Revised: 21 September 2024 / Accepted: 23 September 2024 / Published: 25 September 2024

(This article belongs to the Section Precision and Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

:

Plant disease identification is a crucial issue in agriculture, and with the advancement of deep learning techniques, early and accurate identification of plant diseases has become increasingly critical. In recent years, the rise of vision transformers has attracted significant attention from researchers in various vision-based application areas. We designed a model with an encoder–decoder architecture to efficiently classify plant diseases using a transfer learning approach, which effectively recognizes a large number of plant diseases in multiple crops. The model was tested on the “PlantVillage”, “FGVC8”, and “EMBRAPA” datasets, which contain leaf information from crops such as apples, soybeans, tomatoes, and potatoes. These datasets cover diseases caused by fungi, including rust, spot, and scab, as well as viral diseases such as leaf curl. The model’s performance was rigorously evaluated on datasets, and the results demonstrated its high accuracy. The model achieved 99.9% accuracy on the “PlantVillage” dataset, 97.4% on the “EMBRAPA” dataset, and 91.5% on the “FGVC8” dataset, showcasing its competitiveness with other state-of-the-art models. This study provides a robust and reliable solution for plant disease classification and contributes to the advancement of precision agriculture.

Keywords:

deep learning; imagenet; vision transformer; transfer learning; agriculture

1. Introduction

Agriculture is a key pillar of the global economy, providing essential food, income, economic gains, and employment. Societies worldwide continuously adopt advanced technologies in agriculture to increase food production and meet the growing demands of an expanding population [1]. Plant diseases, caused by fungi, viruses, protozoa, and bacteria [2,3], pose significant risks. Mismanagement or delayed response to these diseases can result in outbreaks or pandemics, severely affecting plant health, structural integrity, yield, and economic returns. A critical challenge in plant protection is the timely identification of disease symptoms [4], which is crucial for implementing effective control measures. Early detection forms the foundation of successful plant disease management and is integral to agricultural decision-making. Recently, the identification of plant diseases has gained increasing importance.

Plant leaves are key indicators of plant health, as early symptoms of disease often appear on the leaves [5]. Detecting damage at an early stage is vital to prevent the spread of the disease to other parts of the plant. Visual assessment of leaves plays an essential role in the early detection of plant diseases and the prevention of crop yield losses. In many regions, particularly in developing countries, traditional methods for classifying plant diseases are still in use. However, these methods, often reliant on farmers’ experience, are time-consuming, labor-intensive, and inefficient [6,7]. They are more prone to misjudgment during the identification process. Since different diseases require different treatment methods, inaccurate disease identification can lead to improper medication. Without proper guidance on the use of agricultural chemicals (such as fertilizers, pesticides, herbicides, etc.), excessive use of these chemicals may occur, resulting in environmental pollution and economic losses. During outbreaks, agricultural experts and botanists must travel to affected areas to provide guidance, which requires significant manpower and time [8]. As a result, automated plant disease recognition has emerged as a critical research area, offering significant benefits for large-scale crop monitoring and the early detection of leaf-based disease symptoms [9,10,11].

In earlier years, machine learning techniques were employed to classify plant diseases, addressing the lack of human expertise. Many of these studies utilized image recognition to categorize images into healthy and diseased categories using specific classifiers. The major classification techniques widely used in previous studies for plant disease recognition include K-nearest neighbor (KNN) [12], support vector machine (SVM) [13], artificial neural network (ANN) [14], and random forest (RF) [15]. Although machine learning has made significant advancements in plant disease classification, there are still some limitations and shortcomings. Most machine learning algorithms are developed in laboratory environments. For certain types of diseases, manual observation by personnel with extensive agricultural knowledge is required. Additionally, these methods are often limited to small datasets [16].

Deep learning (DL) methods have seen increasing application in plant disease recognition, driven by the availability of large datasets and advanced computational resources. Early research primarily focused on evaluating standard CNN architectures to determine the most effective models for this task. Mohanty et al. [17] employed the PlantVillage dataset, comprising 38 classes, alongside classical network models AlexNet and GoogLeNet. They utilized transfer learning and initialization training techniques to classify plant disease images, with GoogLeNet achieving a remarkable accuracy of 99.35% through transfer learning. Sagar et al. [18] leveraged pretrained models, including InceptionV3, InceptionResNetV2, ResNet50, MobileNet, and Densenet169, to classify plant disease images from the PlantVillage dataset, fine-tuning the final layer of each network model. Among these, ResNet50 demonstrated the highest accuracy at 98.2%.

Beyond standard CNN architectures, several specialized models have been developed for plant disease recognition. Mohanty, Hughes, and Marcel [17] trained a deep learning model capable of recognizing 14 crop species and 26 crop diseases, achieving an accuracy of 99.35%. Ma et al. [19] employed a deep CNN to identify symptoms of four cucumber diseases—downy mildew, anthracnose, powdery mildew, and target leaf spot—achieving an accuracy of 93.4%. Kawasaki et al. [20] introduced a CNN-based system for cucumber leaf disease recognition, which attained an accuracy of 94.9%. Gokulnath et al. [21] developed a CNN model with only three convolutional layers specifically for detecting and recognizing diseases in tomato and potato species. Additionally, Keceli et al. [22] created a multitask learning model by integrating a custom CNN model with a pretrained AlexNet model, aiming to predict plant species and their associated diseases. Their model was built using images of tomato, potato, pepper, and corn species from the PlantVillage dataset.

Some researchers have integrated attention mechanisms with CNN models, demonstrating excellent performance in plant disease recognition [23]. Pandey and Jain [24] recently proposed a CNN model based on attention-intensive learning blocks. Li et al. [25] proposed combining multiple extended convolutional and block attention modules with the DenseNet architecture for the purpose of disease classification. Despite the promising results reported in the literature, convolutional neural networks (CNNs) have inherent limitations. The convolutional layer of a CNN only considers local region features during the convolution process and does not explicitly incorporate pixel location information. This limitation can affect the effectiveness of plant disease recognition models.

The introduction of the vision transformer (ViT) [26] has revolutionized the field of computer vision [26,27]. ViT has shown remarkable classification performance on several benchmark datasets, such as Stanford Cars, ImageNet, CIFAR-10, CIFAR-100, Flowers-102, and so on. Motivated by its remarkable performance, researchers have investigated the application of ViT in plant disease detection modeling [28,29,30,31]. ViT is particularly effective in capturing global feature dependencies. For the purpose of detecting plant diseases, Li et al. suggested a model that included 12 transformer blocks after an initial convolutional layer, achieving 96.71% accuracy on the PlantVillage dataset.

Despite these advancements, plant disease recognition continues to present significant challenges due to the vast number of disease species and the wide variety of crops. The task is made more difficult by the commonality of disease symptoms and the dynamic nature of disease patterns. Large-parameter deep learning models necessitate extensive datasets, while lightweight models often struggle with generalization. Furthermore, deep learning models’ limited capacity to generalize is a result of the lack of field data available for training [32]. Additionally, most current methods for plant disease identification do not adequately address the interpretability of their results.

2. Contribution

In order to address the issues mentioned above, this paper aims to develop and train a vision-transformer-based model using a transfer learning approach. This approach employs a pretrained network to extract image features and reduce dimensionality prior to classification, enhancing the model’s ability to detect subtle disease symptoms and decreasing computational complexity. The contributions of this study are as follows:

(1).: A hybrid plant disease detection model based on an encoder–decoder architecture with only 5.6 million trainable parameters is proposed.
(2).: The proposed model outperforms other deep learning models on public datasets. An accuracy of 99.9% was achieved on the Plantvillage dataset, 91.5% on the FGVC8 dataset, and 97.4% on the EMBRAPA dataset.
(3).: The model employs transfer learning to reduce computational requirements.

3. Materials and Methods

The model consists of N layers of an encoder–decoder, where each layer accepts two inputs and produces two outputs. The final layer generates tokens that are fed into the classification layer to produce output probabilities. First, a brief description of the vision transformer (ViT) [26] is provided.

3.1. Vision Transformer

The transformer architecture [26,33] was initially introduced for Natural Language Processing (NLP) tasks, where it delivered substantial advancements. Building on the success of transformers in NLP, researchers extended its application to computer vision tasks. The vision transformer (ViT) is a transformer-based model specifically designed for image classification, directly applied to sequences of image patches.

In ViT, a given image

x \in R^{h \times w \times c}

is split into a series of flat patches

x \in R^{n \times (p^{2} \cdot c)}

, where c is the number of channels, (h, w) is the original image size, (

p

,

p

) is the resolution of each patch image, and

n = h w / p^{2}

is the number of patches. Each patch is then linearly projected into the patch embedding to generate a series of patch embedding

x_{p} \in R^{n \times d}

where d is the model dimension size. To obtain the position information, the learnable position embeddings

p o s \in R^{n \times d}

are added to the patch embedding sequence frames

p_{0} = x_{p} + p o s

. Finally, a learnable class embedding

{c l s}_{0} \in R^{d}

is connected with the patch and position embedding sequences to obtain the token input sequence

Z_{0} = c o n c a t (p_{0}, {c l s}_{0}) \in R^{(n + 1) \times d}

for the ViT backbone.

The ViT layer consists of multi-head self-attention (MSA) blocks [26,34] and feedforward network (FFN) blocks, using layer norm (LN) before each block and adding residual connections after each block:

a_{i - 1} = M S A (L N (Z_{i - 1})) + Z_{i - 1}

(1)

Z_{i} = F F N (L N (a_{i - 1})) + a_{i - 1}

(2)

where

i \in {1, \dots, l}

.

The self-attention mechanism consists of three linear layers that map tokens to three intermediate representations, query

q \in R^{(n + 1) \times d}

, key

k \in R^{(n + 1) \times d}

, and value

v \in R^{(n + 1) \times d}

. The simplified computational flowchart is shown in Figure 1b. Self-attention is computed as follows:

M S A (q, k, v) = s o f t m a x (\frac{q k^{T}}{\sqrt{d}}) \cdot v

(3)

The FFN block is used after the multi-headed self-attention block. It consists of two linear transformations and a nonlinear activation function which can be expressed as

F F N (a) = w_{2} \cdot σ \cdot (w_{1} \cdot a)

(4)

where

w_{1}

and

w_{2}

are the learnable matrices of the two linear transformations, and

σ

denotes the activation function

3.2. Flaws of the ViT Architecture

The ViT architecture model leverages self-attention to learn the relationships between tokens, which can be broken down into three parts. First, the attention between patch tokens and position tokens is used to represent the features of the image. Second, the attention of the [CLS] tokens [35] to both patch tokens and position tokens reflects the importance of different features for classification. The third part is the attention of patch and position tokens to the [CLS] tokens. In summary, the ViT architecture combines self-attention between patch/position tokens and cross-attention between class tokens and patch/position tokens. However, this last part is often unnecessary or even detrimental, as it introduces irrelevant information and increases the model’s complexity [36].

We propose separating [CLS] tokens from patch/position tokens by using the original self-attention layer to capture the attention between patch tokens and position tokens. Additionally, we introduce a new cross-attention layer to extract the attention of [CLS] tokens to patch tokens and position tokens.

3.3. Cross-Attention Mechanism and Dot Product Scale

In the cross-attention mechanism, two input sequences of the same dimension are asymmetrically combined [37], with one sequence serving as query

q \in R^{m \times d}

, and the other as key

k \in R^{n \times d}

and value

v \in R^{n \times d}

inputs. The simplified computational flowchart is shown in Figure 1a. Cross-attention is computed as follows:

C A (q, k, v) = s o f t m a x (\frac{q k^{T}}{\sqrt{d}}) \cdot v

(5)

We found that scaling is required when performing dot product operations [38], which produce larger values as the key

k \in R^{n \times d}

, and query

q \in R^{m \times d}

dimension d increases. This can cause the softmax function to produce extreme outputs (close to 0 or 1), leading to the vanishing gradient problem. The softmax function is defined as

s o f t m a x (z_{i}) = \frac{e^{z_{i}}}{\sum_{j}^{n} e^{z_{j}}}

(6)

where

z_{i}

represents the dot product result. The softmax output can be derived with respect to the input

z_{i}

:

\frac{\partial s o f t m a x (z_{i})}{\partial z_{j}} = s o f t m a x (z_{i}) \times (1 - s o f t m a x (z_{i}))

(7)

\frac{\partial s o f t m a x (z_{i})}{\partial z_{j}} = - s o f t m a x (z_{i}) \times s o f t m a x (z_{j}) f o r i \neq j

(8)

When

z_{i}

becomes large,

e^{z_{i}}

increases significantly, driving the softmax output for

z_{i}

toward 1, while the outputs for other

z_{i}

approach 0. In such cases, the gradient becomes very small (close to 0), resulting in the vanishing gradient problem. Therefore, scaling the dot product is essential. By scaling, the variance of the dot product is normalized to 1, preventing the numerical instability that leads to vanishing gradients.

3.4. Model of Encoder–Decoder Architecture

We designed an encoder–decoder image transformer (EDIT) architecture, where the encoder structure is used for self-attention between patches, and the decoder structure extracts the attention of [CLS] tokens to patches and position tokens.

Our architecture differs from the traditional encoder–decoder architecture. In a traditional encoder–decoder architecture, the encoder takes source tokens as input and generates hidden representations of these tokens layer by layer, from low to high levels. The decoder takes the representation from the last layer of the encoder and generates hidden representations of the [CLS] tokens, also from low to high levels.

In our architecture, the encoder and decoder are compared layer by layer. Both the encoder and decoder in our model have the same number of layers, with layer i of the decoder aligned and coordinated with layer i of the encoder, as shown in Figure 2. The hidden representations of the source tokens in layer i are generated using the self-attention mechanism, while the hidden representations of the target tokens in layer i are generated from the hidden representations of all the source tokens and the preceding target tokens in layer i−1 using the cross-attention mechanism. Consequently, the decoder can extract information from the source markers layer by layer, starting from the lower representation layers and utilizing the finer-grained source information.

Our model is illustrated in Figure 2. The upper part of Figure 2 represents the entire architecture. Plant disease images are processed into smaller image patches through the patch embedding block, and then the Tokenizer block encodes these image patches, converting them into token sequences. The token sequences and the [CLS] token serve as two separate inputs to the backbone network. The backbone consists of N identical layers. Each layer accepts two inputs and generates two outputs, which are then passed as inputs to the next layer. The [CLS] token output from the final layer is fed into the Linear and softmax blocks to produce output probabilities. The lower part of Figure 2 shows two aligned layers, namely the encoder layer and the decoder layer. The encoder layer is exactly the same as in the ViT model, consisting of two sub-layers. The first sub-layer is the multi-head self-attention mechanism, and the second sub-layer is a simple fully connected feedforward network, with layer normalization applied before each sub-layer. Residual connections are applied around both sub-layers. The encoder layer takes the disease token sequence as input and processes it through the two sub-layers, with the multi-head self-attention mechanism capturing relationships between the token sequences. The decoder layer uses a cross-attention mechanism, taking the output of its aligned encoder layer along with the output from the previous decoder layer as input. The cross-attention mechanism, through the [CLS] token, learns the features of the disease token sequence and is used to acquire the disease category features.

3.5. Data Acquisition and Preprocessing

Three publicly available datasets were used in the study. These datasets were collected from different environments with the aim of training the proposed method on various crops and their diseases and evaluating its performance in diverse test scenarios (see Table 1 and Figure 3).

A total of 54,305 images of 14 distinct plant species spread across 38 categories—12 healthy and 26 diseased—are contained in the PlantVillage collection [39]. This dataset was created by the International Institute of Tropical Agriculture and the Penn State College of Agricultural Sciences. It is a useful tool for research and the development of computer-vision-based plant disease detection systems. The images are sourced from diverse environments, including research institutions and contributions from citizen scientists, covering a broad spectrum of plant species and disease types. For example, apple scab and black rot, potato early blight and black rot, and tomato leaf curl, spot, and black rot.

The FGVC8 dataset contains 18,632 high-quality and uniformly sized RGB images belonging to 12 categories. The dataset includes images of real field sceneries with non-uniform background leaf images obtained at different times of day and maturation stages with varying focal length camera settings.

The EMBRAPA dataset [40] contains 46,376 images in 93 categories, each subdivided according to specific criteria. The background was manually removed from all images prior to segmentation, and new images resulting from segmentation were created with healthy tissue comprising at least 20% of the total area to ensure a sharp contrast with diseased tissue. The majority of diseases represented in the images were associated with fungi (77%), followed by viruses (8%), pests (6%), bacteria (3%), phytotoxicity (2%), algae (2%), nutrient deficiencies (1%), and senescence (1%) [40].

Since the images from the three datasets were collected in different environments, preprocessing is required after obtaining the leaf images. First, each dataset is divided into training and validation sets in an 8:2 ratio, as shown in Table 1. Then, the resolution of all images is standardized to 256 × 256 pixels, followed by cropping to 224 × 224 pixels using the CenterCrop method from the torchvision package (0.19.1) to ensure uniform image dimensions. The cropped images are then normalized to reduce the impact of image quality variations on the model’s generalization performance. During the training phase, data augmentation techniques such as ColorJitter and RandAugment are employed to increase the diversity of the training data while preserving the overall image characteristics, thereby enhancing the model’s robustness and ensuring consistency and effective training across all data samples.

3.6. Evaluation Metrics

Each model used in the comparison was evaluated based on standard classification metrics. In our experiments, accuracy, precision, recall, and F1-score were employed as evaluation indicators to provide a comprehensive assessment of model performance. The definitions of TP, TN, FP, and FN are outlined as follows:

True Positive (TP): The actual class of the sample is positive, and the model correctly identifies it as positive.

False Negative (FN): The actual class of the sample is positive, but the model incorrectly identifies it as negative.

False Positive (FP): The actual class of the sample is negative, but the model incorrectly identifies it as positive.

True Negative (TN): The actual class of the sample is negative, and the model correctly identifies it as negative.

Accuracy: In the task of image classification, accuracy is a common performance metric. It can be used to represent the precision of the model, i.e., the number of correctly recognized by the model/total number of samples. The higher the accuracy of the model, the better the performance of the model.

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(9)

Precision: It indicates the percentage of samples that are truly positive classes out of the samples that are recognized as positive classes by the model. It lies between 0 and 1.

P r e c i s i o n = \frac{T P}{T P + F P}

(10)

Recall: The ratio of the number of samples correctly identified as positive class by the model to the total number of positive class samples. The higher the Recall, the more positive class samples are correctly predicted by the model and the better the model performs.

R e c a l l = \frac{T P}{T P + F N}

(11)

F1-Score: The average of the sum of precision and recall.

F 1 s c o r e = 2 * \frac{p r e c i s i o n * r e c a l l}{p r e c i s i o n + r e c a l l}

(12)

Specificity: It indicates the ratio of the number of samples that the model recognizes as negative classes to the total number of negative samples.

S p e c i f i c i t y = \frac{T N}{T N + F P}

(13)

In subsequent sections, we also use the confusion matrix to evaluate the performance of the model. In addition, we evaluate the interpretability of the model predictions using gradient-weighted class activation maps (Grad-CAMs) [41].

3.7. Optimization Method and Loss Function

To avoid overfitting, we use the AdamW optimizer and label smoothing on the cross-entropy loss. The exact selection process is detailed in Section 3.2.

The AdamW algorithm is an extension of Adam, incorporating an L2 regularization term to address some of Adam’s issues through weight decay. The AdamW optimization algorithm offers several advantages. Adding the L2 regularization term helps control the size of the weights while maintaining gradient stability. Additionally, it effectively reduces the risk of overfitting, thereby improving the model’s generalization ability.

Label smoothing mitigates the negative impact caused by softmax and encourages the model to slightly focus on the weights of low-probability distributions. Unlike the standard cross-entropy loss function, label-smoothing cross-entropy loss involves every term in the calculation. The label-smoothing cross-entropy function is defined in Equation (15).

y (i) = \{\begin{matrix} \frac{ε}{n} i \neq c l a s s \\ 1 - ε + \frac{ε}{n} i = c a l s s \end{matrix}

(14)

ε

denotes the degree of smoothing, n denotes the number of categories, and class denotes the current category.

L o s s = - \sum_{i}^{n} y (i) \log (p (x_{i}))

(15)

3.8. Transfer Learning

Transfer learning is a technique that allows for the rapid training and enhancement of model performance on different but related problems by leveraging predictive modeling of similar issues. This approach permits the partial or complete reuse of pretrained models. Transfer learning involves applying a model trained for one problem to a different but related problem. In deep learning, this process entails reusing weights from multiple layers of a pretrained network in a new model, which can be retained, fine-tuned, or fully adjusted during the learning of the new task. This method enables deep neural networks to be trained with relatively little data.

Transfer learning [42] refers to the idea of fine-tuning. This machine learning approach uses information gained from a model trained on one type of problem to train on another, similar task or domain [43]. In deep learning, an initial layer is used to identify task-specific features. During the fine-tuning process, the last layers of the network trained by transfer learning can be replaced and then retrained for the target task. Although fine-tuning requires some learning, it is still much faster than training from scratch and achieves higher accuracy compared to models created completely from scratch.

In this paper, we use our own pretrained models learned on the ImageNet dataset and then transfer them to specific tasks trained on the target dataset.

4. Experimental Results and Analysis

4.1. Experimental Setup

All our experiments are performed on the AutoDL private cloud platform. The models were experimented on a single NVIDIA GeForce RTX 4090 GPU (From Santa Clara, CA, USA) with 24 GB of RAM. In the experiments, the models were implemented using the Python programming language, the PyTorch (2.2.1) deep learning framework, and the Timm library for training and evaluation. The software and hardware configurations are shown in Table 2.

The model hyperparameters were adjusted slightly for each dataset, including learning rate, batch size, weight decay, and attention scaling. We utilized our pretrained models for transfer learning. For the PlantVillage dataset, we used a batch size of 64, a learning rate of 1 × 10⁻³, and attention scaling reduced by a factor of 4. For the FGVC8 dataset, the batch size was 128, with a learning rate of 5 × 10⁻⁴ and a four-fold reduction in attention scaling. For the EMBRAPA dataset, we used a batch size of 128, a learning rate of 1 × 10⁻³, and normal attention scaling. The weight decay was consistently set at 1 × 10⁻⁴, and the hidden dimension was 192.

4.2. Optimization Algorithm Comparison Experiment

To verify the effectiveness of the AdamW optimizer, we conducted comparison tests using our model. In these tests, we only changed the type of optimization algorithm while keeping other parameters constant. The optimization algorithms compared in the experiments were SGD, Adam, and AdamW. The results of the experiments are shown in Table 3.

The AdamW optimization algorithm, a variant of the Adam optimizer, improves the weight decay calculation by incorporating both the adaptive gradient and momentum gradient mechanisms. Compared to the SGD optimization algorithm, AdamW demonstrates reduced sensitivity to the learning rate [44]. As illustrated in Figure 4, after 50 iterations of training, the training loss of the EDIT model with these three optimization algorithms stabilizes. The loss value of the model using the SGD optimization algorithm decreases slowly and exhibits poor convergence. In contrast, models employing the Adam and AdamW optimization algorithms show a more rapid decline in loss values, achieving lower and more consistent loss values. Furthermore, the model utilizing the AdamW optimization algorithm outperforms the Adam algorithm in terms of both loss reduction and convergence speed.

From the experimental results in Table 3, it is evident that the model using the SGD optimization algorithm performs poorly on all three datasets, exhibiting lower accuracy. The models using the Adam and AdamW optimization algorithms produced similar results across the three datasets. However, the AdamW optimization algorithm consistently outperformed the Adam algorithm. These experimental results demonstrate that the AdamW optimization algorithm offers superior performance in deep learning model training compared to optimization algorithms such as SGD, primarily by enhancing the regularization method.

4.3. Performance Evaluation

We used the confusion matrix as a technique to measure the classification accuracy of the EDIT network model in order to assess its effectiveness in plant disease classification. A popular machine learning visualization tool is the confusion matrix, which compares the predicted class labels with the actual class labels for each instance to provide an overview of the model’s performance on the dataset. It shows how confident the model is in its forecasts. The confusion matrix can be used to construct metrics like recall, accuracy, precision, and F1-score, which provide a thorough assessment of the model’s classification performance.

Figure 5 displays the confusion matrix findings that were obtained by applying our model to the FGVC8 dataset validation. Table 4 calculates the corresponding evaluation metrics. The confusion matrices and evaluation metrics for the other two datasets (PlantVillage and EMBRAPA) cannot be provided due to the vast number of categories in these datasets. The FGVC8 dataset, for example, shows that our proposed model has a reasonably good classification performance, and as can be seen in Figure 5, the classifications of Healthy, Rust, and Scab are very accurate. However, there are still some cases of misclassification. Through comparative observation, frog eye leaf spot is found to be more correctly classified, while other diseases are poorly classified. We found that the majority of the incorrectly categorized samples had comparable epigenetic characteristics, i.e., spots; second, there are many distinct disease kinds in the complicated category of leaf diseases, making it challenging to differentiate them from those in other categories because of their similarities. We discovered that the majority of the misclassified samples had comparable epigenetic characteristics, such as spots. These are really major obstacles that affect all models equally, not just those that we proposed.

The training and validation accuracy graphs for all three datasets along with the evaluation metrics are presented in Figure 6 and Table 5. The graphs for the PlantVillage dataset are more convergent while the results for the training and validation data for the FGVC8 and EMBRAPA datasets are more varied. Since epigenetic features often share similarities between several types of diseases, samples in the FGVC8 dataset with numerous diseases are misclassified. The discrepancy in the EMBRAPA dataset may be due to a high degree of sample imbalance, with a small number of samples in individual diseases.

4.4. Selection of Attention Scaling Factor

In our model, cross-attention is one of the central parts, and the dot product is used to compute the relationship between

k \in R^{n \times d}

and query

q \in R^{m \times d}

. In practice, the dot product is scaled by the square root of the dimension of key

k

, denoted as d. In practice, the dot product is scaled by the square root of the dimension of key k, denoted as

\sqrt{d}

. In this section, we discuss the impact of the scaling on the experimental results. We denote d as

s c a l e = \sqrt{d} \div f

, where

f \in (1, 2, 4, 8)

. For the other parameters, we keep them constant and only perform the scaling scale change for the experiment.

The experimental results presented in Table 6 demonstrate that the size of the scaling factor (

s c a l e

) significantly influences the model’s performance. As the scaling factor varies, key metrics such as accuracy, validation loss, and precision exhibit fluctuations across different datasets.

In the PlantVillage dataset, decreasing the scaling factor results in relatively stable accuracy, consistently ranging between 99.59% and 99.89%, indicating a high degree of stability. In contrast, the FGVC8 dataset shows a more pronounced sensitivity to the scaling factor. As the scaling factor increases, accuracy progressively improves, peaking at 91.52%. Notably, at

f = 4

, accuracy reaches its maximum, while validation loss is minimized, suggesting that this scaling factor more effectively captures the relevant data features. In the EMBRAPA dataset, the highest accuracy is also observed at

f = 4

; however, the validation loss is lower at

f = 12

. Although the accuracy improves marginally, the increased validation loss indicates that a larger scaling factor complicates model optimization. Furthermore, when

f = 1 / 4

, all three datasets experience a significant drop in accuracy, likely due to gradient explosion or vanishing gradients.

As illustrated in Figure 7, controlling the scaling factor to regulate gradient magnitude and ensure stability is essential. Unstable gradients result in erratic training. The experimental findings indicate that the impact of the scaling factor on model performance varies across datasets. Therefore, in practical applications, selecting the appropriate scaling factor should be guided by the specific characteristics and size of the dataset to achieve optimal results.

4.5. Performance Comparison with Other Models

We designed an experiment to evaluate this model’s performance against other models for recognizing plant diseases. All models use the same experimental methodology and experimental environment and will be trained and validated on three datasets, PlantVillage, EMBRAPA, and FGVC8.

Among them, four of them are CNN models, and we also evaluate the models of the vision transformer. First, the weights of the models are initialized by the frozen weights of the migration-learning-based models, which are trained on the ImageNet dataset. Table 7 gives the mentioned performance metrics comparing the quantitative performance results of all five models and our proposed model on three different datasets.

As shown in Figure 8. In the PlantVillage dataset, our proposed model achieved the highest accuracy of 99.89%, significantly outperforming classical models such as ResNet, ViT-S, and InceptionV3, demonstrating its strong generalization ability on large datasets. Additionally, our model recorded the lowest validation loss on this dataset, indicating a better fit to the data. As shown in Figure 9. In the FGVC8 dataset, there was considerable variation in the performance of different models. Our model achieved the highest accuracy of 91.52%, outperforming other models. In contrast, ViT-S performed poorly on this dataset, showing weak fitting capability, which suggests its limited ability to handle real-world background data. As shown in Figure 10. In the EMBRAPA dataset, our proposed model achieved an accuracy of 97.42%, significantly surpassing other models and showcasing its clear advantage. By comparing accuracy, validation loss, precision, recall, F1-score, and specificity across different models, our model consistently delivered superior classification results, demonstrating its efficiency in identifying plant diseases across various scenarios.

Our proposed model is capable of extracting the aspects of each disease more effectively, according to the findings of a performance comparison with other models.

As shown in Figure 11, Grad-CAMs further illustrate these cases. Samples from each validation set of the three datasets were used. It is noteworthy that on the Plantvillage dataset, the other models correctly identified the full diseased leaf with a considerably greater accuracy rate; however, the ViT-S model could hardly identify the right disease area for any of the three datasets. In addition, the SEnet model and the InceptionV3 model could identify the diseased portion of the leaf in the EMBRAPA dataset very well. As shown in Figure 12, although the MobilenetV3 model recognition accuracy is not the best, its model parameters and GFLOPS are small enough. Upon comparing all of the models, it is evident that the proposed model is able to locate diseased plant leaf areas with clarity and progressively improve the capture of disease traits inside the targeted area. Meanwhile, we find that the proposed model can recognize plant leaf disease regions well both in the laboratory data background and in real scenarios, which further illustrates the feasibility of the model.

5. Discussion

The proposed model aims to advance AI-based solutions for plant disease identification and monitoring. Fast and accurate models are essential for timely disease detection and intervention, addressing critical food security issues. However, due to the high similarity among different diseases, even human experts struggle with precise identification, making lightweight models inadequate for capturing intricate disease details and providing accurate predictions. Although increasing model depth can enhance feature extraction, it introduces challenges such as gradient vanishing, overfitting, and the growing computational burden associated with deeper architectures. Additionally, these deeper models, while powerful, are not lightweight enough, making them difficult to deploy on resource-constrained devices such as mobile platforms.

Training deep models from scratch also incurs significant computational costs. Transfer learning [42] offers a viable solution to these issues by improving accuracy while reducing training time. State-of-the-art deep learning models, fine-tuned with our proposed model, were analyzed for their performance in plant disease identification. These models are compared based on GFLPOS counts, parameter size, and accuracy. The data in Table 5 demonstrate that the deep network model ResNet50 outperforms the shallower ResNet18, but this improvement comes at the cost of increased model size and computational complexity. Similarly, our proposed transformer model achieves better results than the CNN model, yet it is not as lightweight as desired for real-time applications or deployment in low-resource environments.

Despite the use of transfer learning, the model still requires a large number of images for training to produce reliable results [45]. However, the exact number of images needed remains undefined, and this remains a key challenge. Furthermore, each plant can exhibit multiple diseases, making accurate image labeling difficult. The presence of rare diseases further increases the risk of misrecognition.

Moreover, the diversity of image capture environments across different datasets introduces additional limitations in model development. A significant proportion of datasets focus on single-leaf images, whereas real-life scenarios often involve multiple diseased leaves with complex backgrounds. This discrepancy can impair the model’s performance in practical settings. The datasets used in this study include instances of the same plant with multiple diseases as well as multiple plants with a single disease, adding complexity to the training and evaluation process.

In this paper, we utilized three datasets and achieved the best results, demonstrating that the proposed model is more suitable than existing models for plant disease recognition applications in precision agriculture. However, there are still challenges to be addressed in plant disease recognition and detection. For instance, the high degree of similarity in the appearance and texture of different plant diseases, along with the subtlety of initial disease symptoms on plant leaves, pose significant obstacles to the development of effective AI solutions.

6. Conclusions

Plant diseases represent a significant challenge to global agricultural development, with the potential to lead to the extinction of food crops in severe cases. The automatic diagnosis of plant diseases in the context of agricultural informatization is therefore urgently needed. Plant disease identification using traditional approaches is frequently expensive and time-consuming. The application of artificial intelligence to plant disease recognition is of paramount importance for the early detection of plant diseases. This study examines the potential of deep learning and transfer learning in addressing the challenge of plant disease recognition and proposes a novel deep learning architecture. It can be seen from the detailed analysis of performance metrics such as accuracy, validation loss, precision, recall, F1-score, and specificity that the model has demonstrated outstanding performance on three publicly available datasets with different scales and background conditions. The model’s accuracy on the FGVC8, PlantVillage, and EMBRAPA datasets reached 91.5%, 99.9%, and 97.4%, respectively. The model demonstrated superior performance compared to six advanced deep learning models in plant disease recognition. Furthermore, the interpretability of the model’s predictions was evaluated using the Grad-CAMs method, which revealed that the model possesses considerable interpretability.

Despite the excellent performance of the EDIT model in feature extraction and classification, it has not fully met expectations in addressing the challenges posed by complex backgrounds and uneven data distribution. In future research, we will focus on overcoming dataset-related challenges and aim to develop a real-time object recognition model. This model will be capable of effectively identifying targets in both simple and complex background scenarios, including real-world field conditions. Additionally, we plan to take a lightweight approach, deploying the optimized model in mobile applications to support more plant disease researchers and farmers. Ultimately, we aim to enhance agricultural productivity and promote increased food production.

Author Contributions

Conceptualization, W.F. and G.S.; methodology, W.F.; software, G.S.; validation, G.S.; formal analysis, X.Z.; investigation, G.S.; resources, W.F.; data curation, G.S.; writing—original draft preparation, G.S.; writing—review and editing, W.F.; visualization, G.S.; supervision, W.F.; project administration, W.F.; funding acquisition, W.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Program of New Century Excellent Talents in University of China (no. NCET-11-0942) and the Program of National Natural Science Foundation of China (no. 60703053).

Data Availability Statement

The FGVC8 dataset was obtained from “Plant Pathology 2021-FGVC8” at https://www.kaggle.com/competitions/plant-pathology-2021-fgvc8, accessed on 10 May 2024; the Plantvillage dataset was obtained from “Data for Identification of Plant Leaf Diseases Using a 9-layer Deep Convolutional Neural Network” at https://data.mendeley.com/datasets/tywbtsjrjv/1, accessed on 10 May 2024; the EMBRAPA dataset was created by EMBRAPA researchers and is available at https://www.digipathos-rep.cnptia.embrapa.br, accessed on 10 May 2024.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Baheti, H.; Thakare, A.; Bhople, Y.; Darekar, S.; Dodmani, O. Tomato Plant Leaf Disease Detection Using Inception V3. In Intelligent Systems and Applications: Select Proceedings of ICISA 2022; Springer: Singapore, 2023; pp. 49–60. ISBN 978-981-19658-0-7. [Google Scholar]
Shin, J.; Mahmud, M.S.; Rehman, T.U.; Ravichandran, P.; Heung, B.; Chang, Y.K. Trends and Prospect of Machine Vision Technology for Stresses and Diseases Detection in Precision Agriculture. AgriEngineering 2023, 5, 20–39. [Google Scholar] [CrossRef]
Shamsul Kamar, N.A.; Rahim, S.; Ambrose, A.; Nha, N.H.; Samdin, Z.; Hassan, A.; Nazre, M.; Terhem, R. Pest and Disease Incidence of Coniferous Species in Taman Saujana Hijau, Putrajaya Urban Park, Malaysia. J. For. Res. 2023, 34, 2065–2077. [Google Scholar] [CrossRef]
Poyatos, J.; Molina, D.; Martinez, A.D.; Ser, J.D.; Herrera, F. EvoPruneDeepTL: An Evolutionary Pruning Model for Transfer Learning Based Deep Neural Networks. Neural Netw. 2023, 158, 59–82. [Google Scholar] [CrossRef]
Kaur, P.; Gautam, V. Plant Biotic Disease Identification and Classification Based on Leaf Image: A Review. In Proceedings of the 3rd International Conference on Computing Informatics and Networks: ICCIN 2020; Springer: Singapore, 2021; pp. 597–610. ISBN 978-981-15-9711-4. [Google Scholar]
Islam, M.; Shuvo, M.; Shamsojjaman, M.; Hasan, S.; Hossain, M.; Khatun, T. An Automated Convolutional Neural Network Based Approach for Paddy Leaf Disease Detection. Int. J. Adv. Comput. Sci. Appl. 2021, 12, 280–288. [Google Scholar] [CrossRef]
Kumar, V.V.; Raghunath, K.; Natarajan, R.; Muthukumaran, V.; Joseph, P.R.; Nadesan, T. Paddy Plant Disease Recognition, Risk Analysis, and Classification Using Deep Convolution Neuro-Fuzzy Network. J. Mob. Multimed. 2021, 18, 325–348. [Google Scholar] [CrossRef]
Monowar, M.M.; Hamid, M.A.; Kateb, F.A.; Ohi, A.Q.; Mridha, M.F. Self-Supervised Clustering for Leaf Disease Identification. Agriculture 2022, 12, 814. [Google Scholar] [CrossRef]
Bashish, D.A.; Braik, M.; Bani-Ahmad, S. Detection and Classification of Leaf Diseases Using K-Means-Based Segmentation and Neural-Networks-Based Classification. Inf. Technol. J. 2011, 10, 267–275. [Google Scholar] [CrossRef]
Pooja, V.; Das, R.; Kanchana, V. Identification of Plant Leaf Diseases Using Image Processing Techniques. In Proceedings of the 2017 IEEE Technological Innovations in ICT for Agriculture and Rural Development (TIAR), Chennai, India, 7–8 April 2017; IEEE: New York, NY, USA, 2017; pp. 130–133. [Google Scholar]
Khirade, S.D.; Patil, A.B. Plant Disease Detection Using Image Processing. In Proceedings of the 2015 International Conference on Computing Communication Control and Automation, Greater Noida, India, 15–16 May 2015; IEEE: New York, NY, USA, 2015; pp. 768–771. [Google Scholar]
Guettari, N.; Capelle-Laizé, A.S.; Carré, P. Blind Image Steganalysis Based on Evidential K-Nearest Neighbors. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; IEEE: New York, NY, USA, 2016; pp. 2742–2746. [Google Scholar]
Rani, R.U. Steganalysis on Images Using SVM with Selected Hybrid Features of T-Test Feature Selection Algorithm. Int. J. Adv. Res. Comput. Sci. 2017, 8, 1531. [Google Scholar]
Sheikhan, M.; Pezhmanpour, M.; Moin, M.S. Improved Contourlet-Based Steganalysis Using Binary Particle Swarm Optimization and Radial Basis Neural Networks. Neural Comput. Appl. 2012, 21, 1717–1728. [Google Scholar] [CrossRef]
Kodovsky, J.; Fridrich, J.; Holub, V. Ensemble Classifiers for Steganalysis of Digital Media. IEEE Trans. Inf. Forensics Secur. 2012, 7, 432–444. [Google Scholar] [CrossRef]
WENG, Y.; Zeng, R.; Wu, C.; WANG, M.; WANG, X.; LIU, Y. A Survey on Deep-Learning-Based Plant Phenotype Research in Agriculture. Sci. Sin. Vitae 2019, 49, 698–716. [Google Scholar] [CrossRef]
Mohanty, S.P.; Hughes, D.; Salathe, M. Using Deep Learning for Image-Based Plant Disease Detection. Front. Plant Sci. 2016, 7, 1419. [Google Scholar] [CrossRef] [PubMed]
Sagar, A.; Jacob, D. On Using Transfer Learning For Plant Disease Detection. bioRxiv 2021. [Google Scholar] [CrossRef]
Ma, J.; Du, K.; Zheng, F.; Zhang, L.; Gong, Z.; Sun, Z. A Recognition Method for Cucumber Diseases Using Leaf Symptom Images Based on Deep Convolutional Neural Network. Comput. Electron. Agric. 2018, 154, 18–24. [Google Scholar] [CrossRef]
Kawasaki, Y.; Uga, H.; Kagiwada, S.; Iyatomi, H. Basic Study of Automated Diagnosis of Viral Plant Diseases Using Convolutional Neural Networks. In Proceedings of the Advances in Visual Computing: 11th International Symposium, ISVC 2015, Las Vegas, NV, USA, 14–16 December 2015; Bebis, G., Boyle, R., Parvin, B., Koracin, D., Pavlidis, I., Feris, R., McGraw, T., Elendt, M., Kopper, R., Ragan, E., et al., Eds.; Springer International Publishing: Cham, Switzerland, 2015; pp. 638–645. [Google Scholar]
Gokulnath, B.V.; Devi, G.U. Identifying and Classifying Plant Disease Using Resilient LF-CNN. Ecol. Inform. 2021, 63, 101283. [Google Scholar] [CrossRef]
Keceli, A.S.; Kaya, A.; Catal, C.; Tekinerdogan, B. Deep Learning-Based Multi-Task Prediction System for Plant Disease and Species Detection. Ecol. Inform. 2022, 69, 101679. [Google Scholar] [CrossRef]
Karthik, R.; Hariharan, M.; Anand, S.; Mathikshara, P.; Johnson, A.; Menaka, R. Attention Embedded Residual CNN for Disease Detection in Tomato Leaves. Appl. Soft Comput. 2020, 86, 105933. [Google Scholar] [CrossRef]
Pandey, A.; Jain, K. A Robust Deep Attention Dense Convolutional Neural Network for Plant Leaf Disease Identification and Classification from Smart Phone Captured Real World Images. Ecol. Inform. 2022, 70, 101725. [Google Scholar] [CrossRef]
Li, E.; Wang, L.; Xie, Q.; Gao, R.; Su, Z.; Li, Y. A Novel Deep Learning Method for Maize Disease Identification Based on Small Sample-Size and Complex Background Datasets. Ecol. Inform. 2023, 75, 102011. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arxiv 2023, arXiv:1706.03762. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929. [Google Scholar]
Li, H.; Li, S.; Yu, J.; Han, Y.; Dong, A. Plant Disease and Insect Pest Identification Based on Vision Transformer. In Proceedings of the International Conference on Internet of Things and Machine Learning (IoTML 2021), Shanghai, China, 17–19 December 2021; Kar, P., Guan, S., Eds.; SPIE: St Bellingham, WA, USA, 2022; Volume 12174, p. 121740V. [Google Scholar]
Borhani, Y.; Khoramdel, J.; Najafi, E. A Deep Learning Based Approach for Automated Plant Disease Classification Using Vision Transformer. Sci. Rep. 2022, 12, 11554. [Google Scholar] [CrossRef] [PubMed]
Thai, H.-T.; Tran-Van, N.-Y.; Le, K.-H. Artificial Cognition for Early Leaf Disease Detection Using Vision Transformers. In Proceedings of the 2021 International Conference on Advanced Technologies for Communications (ATC), Virtually, 14–16 October 2021; IEEE: New York, NY, USA, 2021; pp. 33–38. [Google Scholar]
Thakur, P.S.; Khanna, P.; Sheorey, T.; Ojha, A. Vision Transformer for Plant Disease Detection: PlantViT. In Proceedings of the Computer Vision and Image Processing, Nagpur, India, 4–6 November 2022; Raman, B., Murala, S., Chowdhury, A., Dhall, A., Goyal, P., Eds.; Springer International Publishing: Cham, Switzerland, 2022; pp. 501–511. [Google Scholar]
Thakur, P.S.; Khanna, P.; Sheorey, T.; Ojha, A. Trends in Vision-Based Machine Learning Techniques for Plant Disease Identification: A Systematic Review. Expert Syst. Appl. 2022, 208, 118117. [Google Scholar] [CrossRef]
Jiang, S.; Campbell, D.; Lu, Y.; Li, H.; Hartley, R. Learning to Estimate Hidden Motions with Global Motion Aggregation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021. [Google Scholar]
Galassi, A.; Lippi, M.; Torroni, P. Attention in Natural Language Processing. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 4291–4308. [Google Scholar] [CrossRef] [PubMed]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. arXiv 2019, arXiv:1810.04805. [Google Scholar]
Beyer, L.; Zhai, X.; Kolesnikov, A. Better Plain ViT Baselines for ImageNet-1k. arXiv 2022, arXiv:2205.01580. [Google Scholar]
Kim, H.H.; Yu, S.; Yuan, S.; Tomasi, C. Cross-Attention Transformer for Video Interpolation. In Proceedings of the Asian Conference on Computer Vision, Macao, China, 4–8 December 2022. [Google Scholar]
Britz, D.; Goldie, A.; Luong, M.-T.; Le, Q. Massive Exploration of Neural Machine Translation Architectures. arXiv 2017, arXiv:1703.03906. [Google Scholar]
Hughes, D.P.; Salathe, M. An Open Access Repository of Images on Plant Health to Enable the Development of Mobile Disease Diagnostics. arXiv 2016, arXiv:1511.08060. [Google Scholar]
Garcia Arnal Barbedo, J.; Vieira Koenigkan, L.; Almeida Halfeld-Vieira, B.; Veras Costa, R.; Lima Nechet, K.; Vieira Godoy, C.; Lobo Junior, M.; Rodrigues Alves Patricio, F.; Talamini, V.; Gonzaga Chitarra, L.; et al. Annotated Plant Pathology Databases for Image-Based Detection and Recognition of Diseases. IEEE Lat. Am. Trans. 2018, 16, 1749–1757. [Google Scholar] [CrossRef]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. Int. J. Comput. Vis. 2019, 128, 336–359. [Google Scholar] [CrossRef]
Jadhav, S.; Udupi, V.; Patil, S. Identification of Plant Diseases Using Convolutional Neural Networks. Int. J. Inf. Technol. 2020, 13, 2461–2470. [Google Scholar] [CrossRef]
Tiwari, D.; Ashish, M.; Gangwar, N.; Sharma, A.; Patel, S.; Bhardwaj, S. Potato Leaf Diseases Detection Using Deep Learning. In Proceedings of the 2020 4th International Conference on Intelligent Computing and Control Systems (ICICCS), Madurai, India, 13–15 May 2020; pp. 461–466. [Google Scholar]
Peng, H.; Xu, H.; Shen, G.; Liu, H.; Guan, X.; Li, M. A Lightweight Crop Pest Classification Method Based on Improved MobileNet-V2 Model. Agronomy 2024, 14, 1334. [Google Scholar] [CrossRef]
Cetinic, E.; Lipic, T.; Grgic, S. Fine-Tuning Convolutional Neural Networks for Fine Art Classification. Expert Syst. Appl. 2018, 114, 107–118. [Google Scholar] [CrossRef]

Figure 1. The attention calculation process: (a) cross-attention; (b) self-attention.

Figure 2. The model architecture.

Figure 3. Images of samples in the dataset: (a) apple complex (FGVC8), (b) potato late blight (PlantVillage), (c) cajueiro oidio (EMBRAPA).

Figure 4. Training loss under different optimization algorithms for the EMBRAPA dataset.

Figure 5. Confusion matrix of model for the FGVC8 dataset.

Figure 6. Accuracy graphs for datasets.

Figure 7. Training results on the FGVC8 dataset with different scales (

f = 4 v s . f = 1 / 4

).

Figure 7. Training results on the FGVC8 dataset with different scales (

f = 4 v s . f = 1 / 4

).

Figure 8. Training results of different models on PlantVillage dataset.

Figure 9. Training results of different models on FGVC8 dataset.

Figure 10. Training results of different models on EMBRAPA dataset.

Figure 11. Grad-CAMs for (1) FGVC8, (2) PlantVillage, (3) EMBRAPA. (a) Val Images, (b) resnet18, (c) Resnet50, (d) ViT-S, (e) InceptionV3, (f) MobilenetV3, (g) SEnet, (h) Proposed.

Figure 12. For seven models GFLOPS and params (M).

Table 1. Datasets used in the experiments.

Dataset	# of Classes	# of Images	Training Images	Validation Images
PlantVillage	38	54,305	43,444	10,861
FGVC8	12	18,632	14,905	3727
EMBRAPA	93	46,376	37,134	9242

Table 2. Experimental environment.

Name	AutoDL
Python	3.10.8
CPU	14 cores
GPU	RTX4090-24 GB
RAM	64 GB
CUDA	11.8
Torch	2.2.1
Timm	0.9.12

Table 3. Comparison results of optimization algorithms.

Dataset	Optimization	Accuracy (%)	Val Loss	Precision (%)	Recall (%)	F1-Score (%)	Specificity (%)
Plantvillage	SGD	99.15	0.206	98.93	98.73	98.82	99.97
	Adam	99.83	0.144	99.74	99.73	99.73	99.99
	AdamW	99.89	0.132	99.81	99.82	99.82	99.99
FGVC8	SGD	84.44	0.684	84.44	84.41	84.41	98.58
	Adam	91.20	0.426	91.23	91.20	91.20	99.20
	AdamW	91.52	0.425	91.52	91.52	91.52	99.23
EMBRAPA	SGD	88.56	0.604	88.56	88.55	88.55	99.87
	Adam	97.12	0.289	96.36	94.11	94.80	99.96
	AdamW	97.42	0.276	96.73	94.92	95.50	99.97

Table 4. Performance evaluation on the FGVC8 for each class.

Plant	Disease	Class Labels	Precision	Recall	F1-Score	Specificity
Apple	Complex	0	0.707	0.781	0.742	0.969
	Frog eye leaf spot	1	0.944	0.971	0.957	0.988
	Frog eye leaf spot complex	2	0.5	0.091	0.154	0.999
	Healthy	3	0.978	0.989	0.983	0.992
	Powdery mildew	4	0.957	0.953	0.955	0.997
	Powdery mildew complex	5	0.611	0.458	0.524	0.998
	Rust	6	0.884	0.977	0.928	0.987
	Rust complex	7	0.556	0.294	0.385	0.999
	Rust frog eye leaf spot	8	0.5	0.077	0.133	0.999
	Scab	9	0.971	0.978	0.974	0.989
	Scab frog eye leaf spot	10	0.55	0.513	0.531	0.986
	Scab frog eye leaf spot complex	11	0.2	0.025	0.044	0.999
	Micro-average		0.9152	0.9152	0.9152	0.9923

Table 5. Experimental results and performance evaluation.

Dataset	Accuracy (%)	Val Loss	Precision (%)	Recall (%)	F1-Score (%)	Specificity (%)
FGVC8	91.52	0.425	91.52	91.52	91.52	99.23
Plantvillage	99.89	0.132	99.81	99.82	99.82	99.99
EMBRAPA	97.42	0.276	96.73	94.92	95.50	99.97

Table 6. Experimental results under different scales.

Dataset	$f$	Accuracy (%)	Val Loss	Precision (%)
Plantvillage	1/4	99.59	0.168	99.52
	1	99.84	0.142	99.62
	4	99.89	0.132	99.81
	8	99.86	0.146	99.74
	12	99.84	0.144	99.66
FGVC8	1/4	73.01	--	--
	1	90.42	0.436	90.39
	4	91.52	0.425	91.52
	8	90.90	0.427	90.88
	12	91.17	0.430	91.17
EMBRAPA	1/4	96.97	0.285	96.52
	1	97.42	0.276	96.73
	4	97.25	0.269	96.83
	8	97.33	0.277	97.02
	12	97.21	0.268	96.86

Table 7. Comparison of the model’s performance.

Dataset	Model	Accuracy (%)	Val Loss	Precision (%)	Recall (%)	F1-Score (%)	Specificity (%)
Plantvillage	Resnet18	99.35	0.341	99.16	98.92	99.03	99.98
	Resnet50	99.73	0.338	99.60	99.64	99.63	99.99
	ViT-S	99.45	0.193	99.14	99.28	99.21	99.98
	InceptionV3	99.77	0.323	99.56	99.68	99.61	99.99
	MobilenetV3	99.57	0.250	99.44	99.21	99.32	99.98
	SEnet	99.61	0.269	99.36	99.28	99.31	99.98
	Proposed	99.89	0.132	99.81	99.82	99.82	99.99
FGVC8	Resnet18	86.10	0.714	86.15	86.15	86.10	98.74
	Resnet50	88.35	0.636	88.35	88.35	88.35	98.94
	ViT-S	68.31	1.125	68.31	68.31	68.31	97.12
	InceptionV3	89.37	0.627	89.37	89.37	89.37	99.03
	MobilenetV3	87.47	0.601	87.47	87.47	87.47	98.86
	SEnet	90.74	0.559	90.74	90.74	90.74	99.16
	Proposed	91.52	0.425	91.52	91.52	91.52	99.23
EMBRAPA	Resnet18	93.93	0.456	91.76	86.26	87.85	99.93
	Resnet50	95.71	0.495	93.67	91.09	91.93	99.95
	ViT-S	92.56	0.454	92.72	84.87	87.45	99.92
	InceptionV3	96.45	0.351	95.49	94.02	94.46	99.96
	MobilenetV3	95.36	0.379	95.53	92.30	93.57	99.95
	SEnet	94.01	0.413	92.02	86.09	87.76	99.93
	Proposed	97.42	0.276	96.73	94.92	95.50	99.97

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Feng, W.; Sun, G.; Zhang, X. Plant Disease Identification Based on Encoder–Decoder Model. Agronomy 2024, 14, 2208. https://doi.org/10.3390/agronomy14102208

AMA Style

Feng W, Sun G, Zhang X. Plant Disease Identification Based on Encoder–Decoder Model. Agronomy. 2024; 14(10):2208. https://doi.org/10.3390/agronomy14102208

Chicago/Turabian Style

Feng, Wenfeng, Guoying Sun, and Xin Zhang. 2024. "Plant Disease Identification Based on Encoder–Decoder Model" Agronomy 14, no. 10: 2208. https://doi.org/10.3390/agronomy14102208

APA Style

Feng, W., Sun, G., & Zhang, X. (2024). Plant Disease Identification Based on Encoder–Decoder Model. Agronomy, 14(10), 2208. https://doi.org/10.3390/agronomy14102208

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Plant Disease Identification Based on Encoder–Decoder Model

Abstract

1. Introduction

2. Contribution

3. Materials and Methods

3.1. Vision Transformer

3.2. Flaws of the ViT Architecture

3.3. Cross-Attention Mechanism and Dot Product Scale

3.4. Model of Encoder–Decoder Architecture

3.5. Data Acquisition and Preprocessing

3.6. Evaluation Metrics

3.7. Optimization Method and Loss Function

3.8. Transfer Learning

4. Experimental Results and Analysis

4.1. Experimental Setup

4.2. Optimization Algorithm Comparison Experiment

4.3. Performance Evaluation

4.4. Selection of Attention Scaling Factor

4.5. Performance Comparison with Other Models

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI