Diagnosis of Custard Apple Disease Based on Adaptive Information Entropy Data Augmentation and Multiscale Region Aggregation Interactive Visual Transformers

Cui, Kunpeng; Huang, Jianbo; Dai, Guowei; Fan, Jingchao; Dewi, Christine

doi:10.3390/agronomy14112605

Open AccessArticle

Diagnosis of Custard Apple Disease Based on Adaptive Information Entropy Data Augmentation and Multiscale Region Aggregation Interactive Visual Transformers

by

Kunpeng Cui

¹,

Jianbo Huang

¹,

Guowei Dai

^2,*

,

Jingchao Fan

³

and

Christine Dewi

^4,5

¹

Ji Yang College, Zhejiang A&F University, Shaoxing 311800, China

²

College of Computer Science, Sichuan University, Chengdu 610065, China

³

Agricultural Information Institute of CAAS, National Agriculture Science Data Center, Beijing 100081, China

⁴

School of Information Technology, Deakin University, Campus, 221 Burwood Hwy, Burwood, VIC 3125, Australia

⁵

Department of Information Technology, Satya Wacana Christian University, 52-60 Diponegoro Rd, Salatiga City 50711, Indonesia

^*

Author to whom correspondence should be addressed.

Agronomy 2024, 14(11), 2605; https://doi.org/10.3390/agronomy14112605

Submission received: 1 October 2024 / Revised: 27 October 2024 / Accepted: 1 November 2024 / Published: 4 November 2024

(This article belongs to the Special Issue AI, Sensors and Robotics for Smart Agriculture—2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Accurate diagnosis of plant diseases is crucial for crop health. This study introduces the EDA–ViT model, a Vision Transformer (ViT)-based approach that integrates adaptive entropy-based data augmentation for diagnosing custard apple (Annona squamosa) diseases. Traditional models like convolutional neural network and ViT face challenges with local feature extraction and large dataset requirements. EDA–ViT overcomes these by using a multi-scale weighted feature aggregation and a feature interaction module, enhancing both local and global feature extraction. The adaptive data augmentation method refines the training process, boosting accuracy and robustness. With a dataset of 8226 images, EDA–ViT achieved a classification accuracy of 96.58%, an F1 score of 96.10%, and a Matthews Correlation Coefficient (MCC) of 92.24%, outperforming other models. The inclusion of the Deformable Multi-head Self-Attention (DMSA) mechanism further enhanced feature capture. Ablation studies revealed that the adaptive augmentation contributed to a 0.56% accuracy improvement and a 0.34% increase in MCC. In summary, EDA–ViT presents an innovative solution for custard apple disease diagnosis, with potential applications in broader agricultural disease detection, ultimately aiding precision agriculture and crop health management.

Keywords:

plant disease; convolutional neural network; adaptive data augmentation; feature fusion; visual transformer

1. Introduction

Annona squamosa, also known as sugar apple, sweet apple, and cream apple, is a tropical endemic species and one of the most important fruits in the tropical region of China, with a unique flavor, rich nutrition and a broad market development prospect [1]. Fruit cultivation is a cornerstone of global agriculture, contributing significantly to both food security and the economy [2]. However, the productivity and quality of fruits are under constant threat from a wide range of plant diseases. Early identification and proper management of these diseases are critical for reducing crop losses and ensuring food supply stability. Traditional methods of plant disease detection, which rely heavily on expert knowledge, are time-consuming and prone to human error [3]. As a result, there has been a shift toward automated systems that utilize artificial intelligence (AI) and machine learning (ML) to improve accuracy and efficiency in disease identification [4].

The rise of deep learning models, particularly convolutional neural networks (CNNs), has transformed the field of plant disease detection by enabling precise image-based classification [5,6,7]. Deep learning has achieved better performance instead of machine learning feature engineering [8,9], and while AutoML has become an attractive alternative to traditional manual ML practices, features extracted by relying on expert knowledge and experience cannot be compared to those of CNNs [10,11]. Several hybrid approaches have been proposed to optimize the detection process. For instance, Sharma et al. [12] introduced DLMC-Net, a lightweight multi-class classification model designed for real-time leaf disease detection, reducing computational complexity without sacrificing accuracy. Similarly, Zhang et al. [13] proposed a hybrid attention network (HaNet) that integrates frequency–domain attention with channel attention to improve disease identification in citrus orchards, achieving remarkable accuracy despite challenging environmental conditions. More recently, Vision Transformers have emerged as a powerful alternative to CNNs, offering superior performance by capturing global relationships in images through self-attention mechanisms. However, ViT models often struggle with local feature extraction, which is crucial for identifying the fine details associated with certain plant diseases. The literature proposes a multi-task learning-based visual Transformer (PDLC–ViT) for plant disease localization and classification [14]. The model combines multiple mechanisms, including co-scaling, co-attention and cross-attention mechanisms, and implements the ViT within a multi-task learning framework, which was trained and evaluated on the Plant Village dataset and optimized for key hyperparameters by grid search with 99.97% accuracy. Reference [15] demonstrated that ViT significantly improved performance on complex samples, particularly by mitigating the influence of complex backgrounds through feature attention mechanisms. Zeng et al. [16] tackled the challenge of similar disease images from different crops infected by the same pathogen, proposing a large-scale, fine-grained disease classification model (SEViT). This model achieved a classification accuracy of 88.34% on a dataset comprising 38 vegetable diseases. Sharma et al. [17] introduced a novel Swin model with random window shifts to enhance the efficiency of crop leaf disease classification. They highlighted the limitations of ViT, such as its reliance on long-range feature dependencies, its inefficiency in capturing local features, and its requirement for large datasets to converge effectively. Additionally, Sandhya Devi et al. [18] identified that data augmentation techniques could result in model performance divergence and instability, noting that these techniques are primarily applicable in large-scale reinforcement learning experiments. This raises challenges regarding the use of data augmentation methods and the adequacy of dataset sizes.

To address these challenges, various data augmentation (DA) methods are widely employed to enhance the generalization capability of deep neural networks. By generating synthetic images, the quality of training data is improved, thereby increasing model robustness and performance. Data augmentation is particularly effective in situations where real-world datasets are limited or imbalanced, especially when it comes to the impact of large, effective data samples on the performance of ViT models. While the identification of individual plant diseases has been effectively resolved in most studies, the absence of diverse, high-quality datasets that cover multiple plant disease categories—particularly those with cross-species links—continues to impede the development of robust classification models.

This study presents a novel method, EDA–ViT, which integrates an adaptive entropy-based data augmentation (EDA) technique, requiring no manual adjustments, with a ViT classification model based on multi-scale weighted feature aggregation and feature exchange. The backbone is designed using a resolution-progressive approach, consisting of multi-scale and interaction modules, with feature transmission handled by a feedforward network (FFN). Each component independently and flexibly captures interactive, single-scale, and multi-scale information. During training, an adaptive entropy-based data augmentation method is introduced, which autonomously adjusts the execution of each augmentation technique based on classification performance feedback, ensuring non-intrusive coordination throughout the training process. Furthermore, the EDA–ViT model, equipped with a self-attention mechanism, is applied to Custard Apple disease classification. The enhanced multi-head attention mechanism ensures the effective capture of both global and local features. The contributions of this paper are summarized as follows:

Proposed an Adaptive Entropy-Driven Vision Transformer Model (EDA–ViT): This paper introduces the EDA–ViT model, combining adaptive entropy-based data augmentation with a multi-scale feature aggregation mechanism to enhance the accuracy and robustness of custard apple disease diagnosis.

Enhanced Multi-Scale and Global-Local Feature Interaction: The model integrates a multi-scale module and an interaction module, effectively capturing both global context and local features, improving the model’s ability to identify complex disease patterns in leaves and fruits.

Introduced Adaptive Entropy-Based Data Augmentation: The paper presents an innovative entropy-driven data augmentation technique that dynamically adjusts augmentation based on classification difficulty, significantly improving the model’s generalization ability and handling of imbalanced datasets.

Improved Multi-Head Self-Attention Mechanism: An enhanced multi-head self-attention mechanism (DMSA) is embedded into the interaction module, refining spatial position adjustments and feature weighting for more accurate disease identification.

Achieved Superior Performance in Disease Classification: The EDA–ViT model achieved a classification accuracy of 96.58% and an F1 score of 96.10% on the custard apple disease dataset, outperforming existing deep learning models and demonstrating its effectiveness for practical application.

The remainder of this paper is structured as follows: The next section delves into the literature review, summarizing recent advancements and challenges in deep learning applications for agricultural disease detection. Subsequently, the third section introduces the proposed EDA–ViT model, detailing its architecture, including the multi-scale module, interaction module, and adaptive data augmentation approach. The fourth section presents the experimental results and discussions, including dataset descriptions, training configurations, and a comprehensive performance analysis of the EDA–ViT model compared to other advanced models. Finally, the paper concludes with insights and potential implications of the EDA–ViT model for improving disease diagnosis in custard apple cultivation, followed by suggestions for future research directions.

2. Literature Review

The application of deep learning models in agricultural disease detection has gained significant attention in recent years [19], particularly in the classification and identification of citrus fruit diseases. These models have demonstrated exceptional performance in detecting various diseases and improving agricultural productivity.

Uğuz et al. [20] proposed a CNN-based model called CitrusNet for the detection and classification of physical disorders and diseases in citrus fruits, achieving an impressive classification accuracy of 99%. Their study compared various CNN models, such as YOLOv5 and Mask R-CNN, with YOLOv5 achieving the highest accuracy in disease detection with an average precision of 0.99. Similarly, Syed-Ab-Rahman et al. [21] presented a deep learning model based on Faster R-CNN, designed to classify citrus diseases such as citrus black spot (CBS), citrus bacterial canker (CBC), and Huanglongbing (HLB). Their model achieved an overall detection accuracy of 94.37%, surpassing other models in both efficiency and accuracy.

Building on the trend of deep learning models for disease detection, Zhang et al. [22] introduced an approach that combined YOLO-V4 and EfficientNet for real-time disease identification in citrus orchards. Their method achieved an F1-score of 95.3% in disease detection and demonstrated robust performance in complex field conditions. Furthering the advancement in citrus disease recognition, Hassam et al. [23] utilized a modified MobileNet V2 with the Improved Whale Optimization Algorithm (IWOA) to optimize feature extraction, reaching an accuracy of 99.7% on a hybrid dataset. This method significantly reduced computational time, making it a promising solution for real-time disease detection.

In terms of disease severity classification, Dhiman et al. [24] presented a novel VGGNet-based deep learning model, which focused on identifying different severity levels in citrus diseases. The model achieved a high accuracy of 99% for low severity and 98% for high severity, demonstrating the potential for efficient real-time monitoring of disease progression. Beyond citrus fruits, Chen and Wu [25] applied deep learning techniques to grape leaf disease identification, employing a three-stage pipeline that included data augmentation using DCGAN. Their approach achieved superior performance with an identification accuracy of 97.5% despite limited data availability. Giakoumoglou et al. [26] also explored data scarcity by introducing the Generate-Paste-Blend-Detect method, which created synthetic datasets to improve object detection accuracy in agriculture, achieving a mean average precision of 0.661 using YOLOv8.

Finally, Zhang et al. [27] introduced DP-CycleGAN to generate synthetic maize leaf disease images, improving image quality and disease recognition performance under limited data conditions. Their work demonstrated a substantial improvement in generating realistic synthetic data, essential for training robust disease recognition models. Xu et al. [28] proposed a single-image-based specular highlight removal method focused on eliminating highlights from grayscale images. This approach leverages a generative adversarial network (GAN), where the generator produces images without specular highlights, and the discriminator evaluates whether the generator’s output is clear and free of highlights. This framework ensures that more details are preserved in the output images. Dai et al. [29] used a two-stage weather data augmentation technique to improve the generalization performance of a deep model in three real agricultural scenarios, suppress model overfitting, and train and test three plant disease datasets in a multilevel deep information feature fusion extraction network, DFN-PSAN, with an average accuracy and F1 score of over 95.27%. Dai et al. [30] presented a deep learning-based plant disease recognition model (PPLC-Net) that integrates dilated convolution, multilevel attention mechanisms, and global average pooling (GAP) layers. The model employs an innovative weather data augmentation method to expand the sample size as a way to enhance the generalization ability and robustness of feature extraction and achieves a recognition accuracy of 99.702% on the retained test dataset. Similarly, Barman et al. [31] applied ViT to tomato disease detection, achieving an accuracy of 90.99%. The integration of this model into a smartphone app highlights the growing potential of mobile-based solutions for real-time agricultural disease management.

These studies collectively demonstrate the ongoing advancements in deep learning applications for agricultural disease detection. The integration of CNNs, GANs, DAs, and transformer-based models has led to significant improvements in disease recognition accuracy, computational efficiency, and real-time applicability, offering valuable tools for precision agriculture.

3. Materials and Methods

3.1. EDA–ViT

The proposed EDA–ViT architecture, as illustrated in Figure 1, consists of four modules with different resolutions, following a hierarchical design similar to ResNet. It primarily comprises (a) a multi-scale module, (b) an interaction module, and (c) a feedforward network (FFN). The model begins with a convolutional block for coarse data embedding and ends by directly connecting to a classification block. Throughout the process, blocks formed by the multi-scale and interaction modules progressively refine features, which are then passed to the next block via the FFN. Downsampling is performed by the multi-scale module, incorporating a weighted fusion mechanism to combine multiple operations. All three modules employ a residual connection in the form of y = f(x) + x to prevent performance degradation.

3.2. Multiscale Module

Inspired by the convolutional feature aggregation (FA) [32], the multi-scale module is designed to capture information at different scales, as detailed in Figure 1a. It includes multiple convolutional branches (

C o n v S_{i}, i = 1, \dots, n

) operating at varying scales. By linking information from different receptive fields, these operations effectively provide inductive bias, reducing the computational burden introduced by the position embedding in ViT. Specifically, multiple convolutions are applied to transform the input feature map x, resulting in the final fused output Z. A residual connection is also applied to the output, which can be represented as follows:

X_{n o r m} = Norm (X)

(1)

a_{i} = C o n v S_{i} (X n o r m) f o r i = 1, \dots, n

(2)

a_{f u s e d} = Fusion (a_{1}, a_{2}, \dots, a_{n})

(3)

Z = Conv S_{0} (a f u s e d)

(4)

Output = Z + X

(5)

In the equation,

X n o r m

represents the normalized information passed from the beginning of the EDA–ViT model.

Fusion (a_{n})

denotes the fusion method. The introduced Weighted Operation Mixing (WOM) mechanism combines all operations

a_{i}

[33]. This mechanism applies the SoftMax function to a set of learnable weights, using

a_{1}, \dots, a_{n}

as the result of the mixing function

ℱ

. The intermediate representation

x_{o}

can be obtained as follows:

x_{o} = \sum_{n = 1}^{N} \frac{e x p (α_{n})}{\sum_{n^{'} = 1}^{N} e x p (α_{n^{'}})} o_{n} (x),

(6)

3.3. Interaction Module

The relationship between global scene context and local visual information has been a key focus of numerous studies [34]. While ViT effectively capture global information and long-range dependencies in images, they lack the ability to extract local features. To address this limitation, this paper proposes an interaction module to facilitate global-local interaction, centered on a strong representation learning mechanism for key targets. As shown in Figure 2a, this module consists of both local feature extraction (local path) and global feature extraction (global path) branches. First, the input X is processed through the convolution operation and ReLU activation function, producing the local path representation

L (X)

, which captures the local features of the input image. For global features, the normalized input XXX is passed through the DMSA module, resulting in the global path representation

G (X)

. This global path, relying primarily on the self-attention mechanism, captures long-range dependencies and global features, as expressed in the following equation:

L (X) = N o r m (C o n v (R e L U (C o n v (X))))

(7)

G (X) = D M S A (N o r m (X))

(8)

The local and global features from the two branches are fused in a weighted manner to obtain

F (X)

, where

a_{1}

and

a_{g}

are learnable parameters that control the weights of the local and global features. The fused features are further integrated through a convolution operation. Finally, the convolved features are directly added to the input via a residual connection, producing the final output.

F (X) = a_{1} \cdot L (X) + a_{g} \cdot G (X)

(9)

Y = {C o n v}_{s_{0}} (F (X))

(10)

Output = Y + X

(11)

The global path is evidently enhanced by the DMSA mechanism. The introduction of channel attention and spatial attention mechanisms has successively improved the performance of deep neural networks [35]. As ViT evolve across various domains, multi-head self-attention (MSA) has emerged as a powerful tool for handling sequential data. It constructs better global dependency models by applying self-attention to low-resolution feature maps, making it particularly effective for inputs with spatial deformations or irregular distributions. DMSA enhances the naive MSA [36], as illustrated in the dashed box of Figure 2b. In naive MSA, each head processes the input features through the function

f_{q k v}

, generating the query (Q), key (K), and value (V) matrices, as shown in Equation (12). The function

f_{q k v}

comprises the concatenated operations of

f_{q}

,

f_{k}

, and

f_{v}

, with no spatial adjustment or reweighting; it directly extracts QKV features from the input feature map, and the multi-head self-attention follows the same approach. DMSA refines the query vector Q by incorporating a query-aware access mechanism, which fine-tunes spatial positions to improve feature extraction. The key improvements are twofold: the yellow matrix applies spatial fine-tuning to each location, represented by the deformable offset

Δ l

, allowing the model to better capture spatial variations in the input feature map. The blue matrix, denoted as the modulation scalar

Δ m

, reweights features at each location to ensure appropriate handling of spatially diverse inputs.

Q, K, V = f_{q k v} (X)

(12)

In DMSA, when the input feature map X contains L positions, the offset

Δ l

and modulation scalar

Δ m

are first predicted using the query matrix

Q = f_{q} (X)

. These parameters are then employed to resample the feature map

X^{l}

, with spatial adjustment applied through the bilinear interpolation function

S (X_{l}, Δ l)

to fine-tune the positional information.

X^{l} = S (X_{l}, Δ l) \cdot Δ m

(13)

In this context,

Δ l

represents the positional offset, while

Δ m

serves as a weighting factor within the range of (0,1). After these adjustments, the modified feature map

X^{l}

is used to compute the updated key and value matrices, K and V.

3.4. Adaptive Data Augmentation

The sample size and class imbalance in a dataset are critical factors influencing model performance. Traditional data augmentation is divided into offline and online approaches, with online augmentation transformations applied dynamically during training to avoid the need for generating large datasets beforehand [37]. Data augmentation methods can be categorized into three types. The first category involves image erasure techniques, such as Cutout, GridMask, and Random Erasing, which randomly mask certain regions of the image. However, these approaches may overlook important structural features. The second category involves image-mixing techniques, including Mixup and CutMix, which combine information from multiple images to create new training samples, though the randomness introduced may result in distribution shifts and noise. The third category is automatic data augmentation, with methods such as RandAugment and TrivialAugment leveraging reinforcement learning or various strategies to identify optimal augmentation combinations [38]. To address these challenges, a method is proposed that dynamically adjusts augmentation magnitude based on entropy feedback, adapting to both image characteristics and the model’s training process. Compared to search spaces based on genetic algorithms (GA) [39], the Entropy-based Adaptive Data Augmentation (EDA) method offers a non-intrusive and lightweight solution [40]. The mathematical formulation is as follows:

m a x (x) = 1 + \frac{1}{\log k} \sum_{i = 1}^{k} g (x)_{i} l o g g (x)_{i}

(14)

Information entropy is defined as

g (x)_{i} l o g g (x)_{i}

, where higher entropy indicates greater classification difficulty, and lower entropy suggests easier classification.

g (x)

represents a probability distribution that satisfies

\sum_{i = 1}^{k} g (x)_{i} = 1

. This distribution is produced by the Softmax function, i.e.,

g (x) = s o f t m a x (f_{θ} (x)) \in R^{k}

, where

f_{θ}

denotes the classification model, and the input sample

x \in R^{n}

results in the model output

f_{θ} (x)

. During data augmentation, the magnitude of sub-methods is dynamically adjusted based on the sample’s entropy value. As

m a x (x)

→1, the augmentation becomes more diverse; when

m a x (x)

→0, the augmentation changes become minimal. This dynamic adjustment continues throughout the training process to ensure adaptive augmentation. Additionally, combined loss functions perform effectively when new dynamic modules or variables are introduced [41]. In this study, information entropy is combined with cross-entropy loss (CEloss), and the corresponding formula is as follows:

E n t L o s s (x) = \frac{1}{\log k} \sum_{i = 1}^{k} g (x)_{i} l o g g (x)_{i} .

(15)

Similarly, entropy regularization loss (EntLoss) is introduced to encourage the model to classify with higher confidence while reducing output entropy, thereby enhancing overall model performance.

The selective application of augmentation techniques, such as random cropping or flipping, ensures minimal augmentation for simpler images while applying more extensive transformations to complex images. This dynamic adjustment optimizes feature extraction by preserving key information in less complex images and enriching feature diversity in more complex ones. As a result, the approach enhances the model’s overall robustness and accuracy by tailoring augmentations to the visual characteristics of each input. Below is the detailed Algorithm 1 with the corresponding mathematical formulation.

Algorithm 1 Entropy-Based Adaptive Data Augmentation (EDA)

Require: Training dataset

D = {(x_{i}, y_{i})}_{i = 1}^{N}

, model

f_{θ}

, augmentation space A, number of classes k

Ensure: Augmented dataset

D^{'}

1: for each batch

B \subset D

do

2: for each

(x, y) \in B

do

3: Step 1: Compute the softmax probability distribution:

g (x)_{i} = \frac{e x p (f_{θ} (x)_{i})}{\sum_{j = 1}^{k} e x p (f_{θ} (x)_{j})}, i = 1, \dots, k

4: Step 2: Compute the information entropy:

H (x) = - \sum_{i = 1}^{k} g (x)_{i} l o g (g (x)_{i})

5: Step 3: Compute the augmentation magnitude:

m a x (x) = 1 + \frac{H (x)}{l o g (k)}

6: As

mag (x) \to 1

, the augmentation becomes more diverse; when

m a g (x) \to 0

, changes are minimal.

7: Step 4: Randomly select an augmentation operation

a \in A

8: Step 5: Apply the augmentation operation with computed magnitude:

x^{'} = a (x, m a x (x))

9: Step 6: Compute the Entropy Regularization Loss (EntLoss):

E n t L o s s (x) = \frac{1}{l o g (k)} \sum_{i = 1}^{k} g (x)_{i} l o g (g (x)_{i})

10: Combine with Cross-Entropy Loss (CEloss) to en hance performance:

L o s s = C E l o s s + λ \cdot E n t L o s s

where

λ

is a weighting hyperparameter.

11: end for

12: end for

13: return Augmented dataset

D^{'}

4. Experimental Results and Discussion

4.1. Introducing the Datasets

In the past, disease identification in agriculture has predominantly focused on either plant diseases or pest infestations, with limited attention given to datasets encompassing both leaf and fruit diseases. Due to the considerable economic and health benefits of Annona squamosa, this study conducts a disease classification experiment on a comprehensive Annona squamosa dataset. The dataset comprises 8226 images of Annona squamosa fruits and leaves, categorized into six distinct types: Athracnose, Blank Canker, Diplodia Rot, Leaf Spot on fruit, Leaf Spot on leaf, and Mealy Bug, which include five fruit diseases and one leaf disease. As illustrated in Figure 3, images were collected across various growth stages, environmental conditions, and disease types, capturing different perspectives of single-type diseases from multiple angles. The dataset is primarily composed of diseased fruit images, with only 1255 images depicting diseased leaves, highlighting the challenge for deep learning models in distinguishing between these two disease categories.

4.2. Experimental Details

The experiments were conducted using the PyTorch 2.4.1 deep learning framework, with Python 3.12.4 in a Windows 10 operating system environment. The hardware setup included an Nvidia GeForce RTX 4090 GPU with 32 GB of memory and an Intel(R) Core(TM) i9-14900K CPU operating at 2.90 GHz.

During training, input samples were uniformly resized to 256 × 256 × 3, and the images were normalized using channel-wise mean and standard deviation vectors estimated from the training set. Each active learning cycle consisted of 200 epochs, with samples divided into mini-batches, where the batch size is set to 16. The initial learning rate is 0.01, which is reduced to 10% of its original value at epochs 55 and 75. The Adam optimizer was used to optimize the model, with momentum and weight decay values set at 0.9 and 0.0005, respectively. Early stopping was applied, halting training if accuracy remained unchanged for 10 consecutive epochs. The loss function was a combination of cross-entropy loss (CEloss) and entropy regularization loss (EntLoss).

To assess the model’s effectiveness, standard key performance indicators (KPIs) such as the Matthews correlation coefficient (MCC), F1 score, precision, recall, and accuracy were employed. The MCC considers all four classification outcomes, including true positives, true negatives, false positives, and false negatives, reducing the impact of randomness in the evaluation. These metrics were used to evaluate the proposed model’s performance, particularly in accurately classifying images of Custard Apple diseases.

4.3. Classification Results of Advanced Models

In this section, a comprehensive experimental comparison is performed using the Custard Apple dataset. The tested models include convolutional neural networks (CNNs) and vision transformer models. Table 1 presents the experimental results for various neural network models on the Custard Apple dataset. Both EDA–ViT and VMamba demonstrated outstanding performance across multiple metrics, particularly in accuracy, F1 score, and MCC. EDA–ViT achieved an accuracy of 0.9658, an F1 score of 0.9610, and an MCC of 0.9224, making it the top-performing model. This success is attributed to EDA–ViT’s unique multi-scale module, which captured global information and long-range dependencies, as well as a local information interaction module that enables the model to better capture both global and local features in complex image recognition tasks. VMamba followed closely with an F1 score of 0.9536 and an MCC of 0.9262, exhibiting stable performance. The architecture builds upon ViT, with likely optimizations in weight sharing and feature extraction that improve its overall performance.

Among transformer-based models, ViT-B, DeiT-Ti, LocalViT, and Swin generally outperform traditional CNNs (such as DenseNet and ResNetV2), though differences in trade-offs are evident. ViT-B achieved an F1 score of 0.9524, indicating strong image feature capture for visual tasks. However, its frames per second (FPS) of 13.54 revealed a limitation in computational efficiency, making it suitable for scenarios with lower real-time requirements, but less ideal for applications in agriculture, where efficiency is crucial.

In the CNN category, Inception and Xception stood out. Inception achieved an F1 score of 0.9403 and an accuracy of 0.9418, approaching the performance of ViT-based models. This was largely due to its multi-scale convolutional architecture, which enabled effective image capture across different scales. Lightweight models such as EfficientNet and MobileNetV2 strike a good balance between accuracy and FPS.

Emerging models like EfficientFormerV2 and MaxViT also exhibited strong performance across multiple metrics. EfficientFormerV2 achieved an F1 score of 0.9446 and an MCC of 0.9107, reflecting strong feature extraction capabilities and an effective balance between accuracy and computational efficiency. MaxViT, with an FPS of 13.67—slightly lower than that of EfficientFormerV2—demonstrated better generalization in handling complex visual tasks, with an accuracy of 0.9493 and an MCC of 0.9185.

Figure 4 presents the confusion matrices for the best-performing models (EDA–ViT and VMamba) and the worst-performing models (VGG-16 and VGG-19). These results were obtained by randomly selecting 300 samples from the test set for each run. The matrix indices 0 to 5 correspond to the categories Anthracnose, Blank Canker, Diplodia Rot, Leaf Spot on fruit, Leaf Spot on leaf, and Mealy Bug, respectively. The proposed EDA–ViT model demonstrated superior performance across all categories, particularly excelling in recognizing Diplodia Rot (38) and Leaf Spot on leaf (52). According to the analysis of Figure 3, images in the Diplodia Rot category often include both leaves and infected fruit, creating cross-category features that are more challenging to distinguish. The interaction module in the proposed model, DMSA, enhanced the influence of local information, while context-aware data from the local path facilitates the differentiation between these overlapping features. VMamba, while utilizing a cross-scanning module (CSM) to achieve a global receptive field, struggled with distinguishing cross-category information when operating at this level. In contrast, the adaptive entropy data augmentation in EDA–ViT allowed for dynamic feature enhancement without constraining the hierarchical feature representations, leading to improved performance. The VGG models exhibited relatively poorer performance. Although VGG-19 incorporates additional layers compared to VGG-16, the improvement is marginal. VGG-19 shows a higher misclassification rate for Diplodia Rot (6), and similarly, VGG-16 has a high error rate for Anthracnose (6). The misclassification rate for each class was calculated as: (FP + FN)/total instances of that class. The highest error rates for VGG-19 and VGG-16 are 2.0% and 2.67%, respectively, while EDA–ViT and VMamba achieved maximum error rates of 0.32% and 1.68%, respectively. Overall accuracy alone does not fully capture model performance. A deeper analysis of category-specific error rates through the confusion matrix offers a more detailed understanding of the strengths and weaknesses of each model.

4.4. Performance Comparison of Hybrid EntLoss

EDA–ViT improved performance through adaptive entropy-based data augmentation and entropy regularization loss. Figure 5 demonstrates that, compared to baseline training, the combination of EntLoss and EDA yielded a positive effect, with a significant reduction in the overall error rate.

In the ablation study, the global path of the interaction module containing DMAS was replaced with a local path for evaluation. In the adaptive data augmentation algorithm (EDA) test, Cutout, CutMix, and TrivialAugment were selected for comparison to assess the effectiveness of each component in the proposed EDA–ViT framework. As shown in Table 2, we systematically analyzed the impact of various modules, including DMAS, EDA, EntLoss, CELoss, and different data augmentation strategies (Cutout, CutMix, and TrivialAugment). The baseline model, utilizing only DMAS, CELoss, and TrivialAugment, achieved 94.78% accuracy with an MCC score of 0.9112, demonstrating the fundamental effectiveness of the attention mechanism. Replacing TrivialAugment with CutMix improved performance to 95.12% accuracy and an MCC of 0.9148, indicating that CutMix provided more effective regularization by preserving spatial context during augmentation. However, integrating Cutout slightly reduced performance to 94.25% accuracy, likely due to its aggressive information loss. Removing all augmentation strategies resulted in a performance drop to 92.78% accuracy, highlighting the importance of data augmentation in preventing overfitting. Incorporating the EDA module significantly enhanced accuracy to 95.78%, confirming our hypothesis that adaptive entropy-based data augmentation could improve feature discrimination. The EntLoss component complemented CELoss, raising accuracy to 93.33% when used together. Notably, the complete model configuration, including all components, achieved the best performance with 96.11% accuracy and an MCC of 0.9224. This demonstrates the synergistic effect of DMAS’s multi-scale feature learning and diverse representation subspaces, EDA’s controlled data augmentation, balanced loss functions, and appropriate augmentation strategies in enhancing the model’s representational capacity and generalization ability.

4.5. Adaptive Data Augmentation Analysis

In this section, data augmentation experiments were conducted on the proposed EDA–ViT to compare the performance of adaptive entropy-based data augmentation. As shown in Figure 6, EDA–ViT consistently outperformed other models across all experiments, achieving an accuracy of 0.961 ± 0.002, indicating its high stability and robustness in classification tasks. The MCC score of EDA–ViT was 0.922 ± 0.003, highlighting its exceptional ability to handle imbalanced data. In comparison, models incorporating AutoAugment and RandAugment strategies also exhibited strong performance. EDA–ViT with AutoAugment achieved an accuracy of 0.951 ± 0.020, which was close to the performance of the original EDA–ViT, demonstrating the effectiveness of AutoAugment in improving the model’s generalization capabilities. EDA–ViT using RandAugment followed closely with an accuracy of 0.948 ± 0.019, showing that RandAugment could also significantly enhance classification performance. The Cutout augmentation strategy showed slightly weaker results, with an F1-score of 0.928 ± 0.006 and an accuracy of 0.925 ± 0.017. This relatively lower performance could be attributed to the Cutout technique, which randomly masked certain regions of the image. In datasets like Diplodia Rot and Leaf Spot on fruit, which contain other key features, the advantage of masking becomes less apparent when non-essential areas are obscured. Overall, the EDA–ViT model demonstrated high performance both in its original form and when combined with various data augmentation strategies.

4.6. Extensive Testing for Plant Diseases

The method was tested on a diverse and novel set of plant disease datasets, including Paddy, Cabbage, Coffee, and FGVC8 [42]. The experimental results presented in Table 3 demonstrate the superior performance of the proposed EDA–ViT model across diverse plant disease datasets. In comparison with conventional CNN architectures like ResNet-50 and RepVGG-A0, EDA–ViT achieved notable improvements, particularly exhibiting accuracy gains of 1.69% and 2.32% respectively on the Paddy dataset. This enhancement can be attributed to EDA–ViT’s efficient dual-stream attention mechanism, which enables more comprehensive feature extraction from both local and global contexts of plant disease symptoms. The performance advantage was further evidenced in the Cabbage dataset, where EDA–ViT attains an accuracy of 0.9972, surpassing the baseline Swin-tiny model by 2.27%. Notably, on the challenging Coffee dataset, which presents complex disease manifestations, EDA–ViT demonstrated substantial improvement with an accuracy of 0.8359, outperforming the previous state-of-the-art Swin-tiny+EFG by 2.92%. This significant enhancement can be ascribed to the model’s enhanced feature guidance module, which facilitates more effective capture of subtle disease-specific patterns. The model’s generalization capability was further validated on the FGVC8 dataset, where it achieved an accuracy of 0.9611, surpassing that of the specialized DFN-PSAN architecture by 2.87%. These comprehensive results underscore the effectiveness of the proposed attention-based architecture in capturing both fine-grained disease characteristics and global contextual information, thereby enabling more robust and accurate plant disease classification across diverse agricultural scenarios.

4.7. Vision Attention Visualization

Class Activation Mapping (CAM) is used to explain the network’s decision-making process. In this study, the weights from the classification head of the EDA–ViT model are selected, with the target layer being the convolutional layer after the resolution is reduced to 32. CAM employs a color gradient from blue (least important) to red (most important) to highlight the regions of interest for disease diagnosis.

As shown in Figure 7, the images depict various plant diseases affecting different parts, including fruit surfaces, leaf structures, and bark textures. When target images contain unrelated features such as leaves or soil, CAM consistently highlighted the diseased or infected areas on the plant’s leaves or fruits, demonstrating EDA–ViT’s ability to focus on relevant features for diagnosis. Additionally, the receptive field of these focus areas is relatively large, incorporating broader regions that might initially seem unrelated but actually assist the model in linking contextual information for local feature recognition. This aligns with the findings of Michał et al. [43], where SAM–ViT was used to explain coronary artery stenosis detection. Wang et al. [44] simulate patch dependencies using a Squeezing and Stimulating Patches (P-SE) technique while removing context-aware patches marked by the CAM module from the main branch, thus refining the model’s focus points, especially in simpler datasets.

This visualization technique provides insights into the decision-making process of plant disease classification models, highlighting the capability of EDA–ViT to identify and localize specific disease features across different plant structures and pathologies. The consistency between visually evident symptoms and the CAM-highlighted regions suggests that the model had learned to focus on clinically relevant areas to accurately diagnose diseases in custard apple crops.

5. Discussion

5.1. Synergistic Relationship Between DMSA and EDA

The DMSA module enhanced the model’s capability to process diverse visual information by capturing image features across multiple scales. Unlike traditional Vision Transformers (ViTs) that rely on fixed attention mechanisms, DMSA dynamically adjusts its attention weights based on the characteristics of the input image, enabling it to capture critical information more effectively. Meanwhile, adaptive augmentation analyzes the complexity of input data in real time, dynamically adjusting the augmentation strategies to provide more targeted sample diversity during training. The integration of DMSA with adaptive augmentation in the EDA–ViT framework offers enhanced flexibility and robustness for handling visual tasks. Specifically, DMSA improves the capture of both local details and global context through multi-scale feature fusion, while adaptive augmentation increases sample diversity, strengthening the model’s resistance to overfitting. This synergy not only enhances overall performance but also equips EDA–ViT with superior generalization ability, enabling it to excel across a range of visual tasks. Moreover, experimental results demonstrate that EDA–ViT significantly outperforms traditional ViT architectures across multiple benchmark datasets, validating the critical role of combining DMSA with adaptive augmentation in boosting the effectiveness of vision models. In summary, the collaborative mechanism between DMSA and adaptive augmentation not only provides Vision Transformers with more powerful feature extraction capabilities but also lays a new theoretical foundation for tackling complex visual tasks. It highlights promising directions for future research in optimizing model design and practical applications.

5.2. Challenges and Advances in Large-Scale Agricultural Monitoring

In large-scale agricultural monitoring, several key challenges influence the effectiveness and applicability of advanced models. Environmental variability, such as shifts in lighting, weather, and terrain, introduces inconsistencies in image features, which can negatively affect the model’s accuracy. Seasonal differences and crop growth stages further complicate performance, requiring the incorporation of such variables during training to enhance the model’s robustness and adaptability. Image resolution is another critical factor. High-resolution images, while providing greater detail, demand more computational resources, potentially limiting real-time processing. Striking a balance between resolution and computational efficiency is crucial, especially in scenarios where resources are constrained. Recent studies emphasize the importance of optimizing imagery sensors, such as multispectral and hyperspectral cameras, and employing UAVs and satellites for data collection to improve spatial and temporal resolution without overwhelming computational capabilities [45].

Real-time performance is essential for effective agricultural monitoring, particularly in dynamic field conditions where rapid decision-making is necessary. However, complex models with extensive inference times can hinder their practical utility. Current research suggests integrating deep learning approaches, including convolutional neural networks (CNNs) and transfer learning, to achieve faster model responses. Furthermore, innovations like data assimilation—combining remote sensing data with crop growth models—enable more accurate predictions and improve adaptability in varying environmental conditions. Precision agriculture benefits from these advancements by employing technologies such as the ensemble Kalman filter and Bayesian assimilation strategies, which enhance monitoring accuracy and yield prediction. However, challenges persist, including handling large datasets, improving model generalization, and minimizing the computational burden for real-time deployment in agricultural systems [46]. Future research aims to refine these models further, focusing on integrating multiple data sources and advancing the practical application of real-time monitoring systems in agriculture. These insights align with recent work in the field, emphasizing the importance of balancing data quality with computational efficiency and exploring advanced data fusion techniques to enhance agricultural monitoring and forecasting capabilities.

5.3. Agricultural Automation Can Explain Synergies

The application of automated plant disease diagnostic technology is increasingly widespread in agriculture, particularly with the support of deep learning and computer vision. While these technologies hold potential for enhancing diagnostic efficiency and accuracy, they also raise ethical challenges. The risks of false positives and false negatives are especially pronounced in critical agricultural contexts, where diagnostic errors may lead to misinformed crop management strategies, thereby impacting crop yield and quality significantly [47,48].

False positives, where healthy plants are misidentified as diseased, can lead farmers to undertake unnecessary interventions, increasing chemical pesticide use and causing environmental pollution and resource wastage. Conversely, false negatives, where diseased plants go undetected, may allow diseases to spread, ultimately harming overall crop health. To mitigate these risks, human-machine collaboration emerges as a key approach. Integrating agricultural experts’ knowledge with automated systems can enhance diagnostic reliability [49,50]. Human experts serve as a secondary safeguard, reviewing uncertain cases to ensure decision accuracy. Additionally, model interpretability plays a crucial role in addressing ethical concerns associated with automated diagnostics. Traditional deep learning models, particularly convolutional neural networks (CNNs) and vision transformers (ViTs), while effective in plant disease classification, lack interpretability in their decision-making processes. This limitation makes it difficult for farmers and agricultural experts to fully trust model judgments. By implementing explainable artificial intelligence (XAI) techniques, the decision-making process of models becomes more transparent, enabling users to understand how conclusions are reached [50]. For example, Class Activation Mapping (CAM) can help identify the image regions a model focuses on, aiding users in comprehending the basis for the model’s diagnosis.

The integration of such interpretability techniques fosters farmers’ trust in diagnostic systems, enhancing their effectiveness in real-world agricultural settings. In summary, while automated plant disease diagnostics offer substantial advantages for crop health management efficiency, the ethical concerns associated with their application remain significant. Strengthening human-machine collaboration and improving model interpretability can help alleviate the ethical risks of false positives and false negatives. This approach not only enhances system reliability but also promotes acceptance and trust in automated diagnostic technologies among users.

6. Conclusions

Custard apple, a high-quality tropical fruit, has significant market potential due to its unique taste and rich nutritional value. However, it is also highly susceptible to various diseases and pests during its growth. Early detection and diagnosis of diseases are therefore crucial for effective pest and disease control in custard apple cultivation. This paper proposes a lightweight custard apple classification model, EDA–ViT, which integrates a multi-branch aggregation module, a multi-scale feature-weighted fusion module, and an interaction module for global scene context and local visual information. The multi-scale module enhances the focus on multiple diseased areas during feature extraction, while the interaction module captures both global context and local features of leaf and fruit lesions. The DMSA mechanism, embedded within the interaction module, refines the multi-head attention mechanism by adjusting spatial positions to improve feature extraction. This is achieved by fusing local information and global context through the integration of local and global paths. Experimental results demonstrate that the EDA–ViT model surpasses other state-of-the-art models in custard apple disease classification, achieving an accuracy of 96.58%, an F1 score of 96.10%, and an MCC of 92.24%. During EDA–ViT training, a non-intrusive adaptive entropy-based data augmentation (EDA) method is introduced, leading to an accuracy improvement of 0.56% and an MCC increase of 0.34%. When combined with the EntLoss loss function, accuracy improves by 0.91%, and MCC by 0.50%. The classification and visualization of custard apple images intuitively validate that the EDA–ViT model enhances the focus on key features of leaf and fruit lesions, resulting in more precise classifications. It effectively captures the underlying texture features of diseased images, leading to the extraction of critical characteristics. Ablation studies confirm that the dual-attention mechanism improves the accuracy of lesion feature extraction and localization, allowing the model to not only focus on the lesion center but also recognize lesion edges and early symptoms, which are essential for timely diagnosis and prevention. The model exhibited strong generalization capabilities, successfully identifying common diseases in custard apple from fruits to leaves. This approach can serve as an effective reference for disease identification in crop monitoring through stationary systems or UAV-based field monitoring of crops.

Author Contributions

Conceptualization, K.C. and G.D.; methodology, K.C. and G.D.; validation, G.D. and J.H.; formal analysis, K.C.; investigation, K.C.; resources, K.C.; data curation, K.C. and G.D.; writing—original draft preparation, K.C. and G.D.; writing—review and editing, K.C., J.H., G.D., C.D. and J.F.; visualization, G.D.; supervision, K.C. and G.D.; funding acquisition, K.C. and G.D. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the by the Zhejiang Agriculture and Forestry University, Jiyang College, under Grant 2024ZJC0028, and in part by the Sichuan Natural Science Foundation under Grant 2023NSFSC1403.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available in Mendeley at https://data.mendeley.com/datasets/jtgh2885yf/2 (accessed on 20 September 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Moussa, A.Y.; Siddiqui, S.A.; Elhawary, E.A.; Guo, K.; Anwar, S.; Xu, B. Phytochemical constituents, bioactivities, and applications of custard apple (Annona squamosa L.): A narrative review. Food Chem. 2024, 459, 140363. [Google Scholar] [CrossRef]
Gupta, S.; Tripathi, A.K. Fruit and vegetable disease detection and classification: Recent trends, challenges, and future opportunities. Eng. Appl. Artif. Intell. 2024, 133, 108260. [Google Scholar] [CrossRef]
Javidan, S.M.; Banakar, A.; Rahnama, K.; Vakilian, K.A.; Ampatzidis, Y. Feature engineering to identify plant diseases using image processing and artificial intelligence: A comprehensive review. Smart Agric. Technol. 2024, 8, 100480. [Google Scholar] [CrossRef]
Nargesi, M.H.; Kheiralipour, K. Ability of visible imaging and machine learning in detection of chickpea flour adulterant in original cinnamon and pepper powders. Heliyon 2024, 10, e35944. [Google Scholar] [CrossRef] [PubMed]
Liu, J.; Wang, X. Multisource information fusion method for vegetable disease detection. BMC Plant Biol. 2024, 24, 738. [Google Scholar] [CrossRef]
Malik, M.M.; Fayyaz, A.M.; Yasmin, M.; Abdulkadir, S.J.; Al-Selwi, S.M.; Raza, M.; Waheed, S. A novel deep CNN model with entropy coded sine cosine for corn disease classification. J. King Saud Univ.-Comput. Inf. Sci. 2024, 36, 102126. [Google Scholar] [CrossRef]
Ojo, M.O.; Zahid, A. Improving Deep Learning Classifiers Performance via Preprocessing and Class Imbalance Approaches in a Plant Disease Detection Pipeline. Agronomy 2023, 13, 887. [Google Scholar] [CrossRef]
Ruan, G.; Schmidhalter, U.; Yuan, F.; Cammarano, D.; Liu, X.; Tian, Y.; Zhu, Y.; Cao, W.; Cao, Q. Exploring the transferability of wheat nitrogen status estimation with multisource data and Evolutionary Algorithm-Deep Learning (EA-DL) framework. Eur. J. Agron. 2023, 143, 126727. [Google Scholar] [CrossRef]
Shafik, W.; Tufail, A.; Liyanage, C.D.S.; Apong, R.A.A.H.M. Using transfer learning-based plant disease classification and detection for sustainable agriculture. BMC Plant Biol. 2024, 24, 136. [Google Scholar] [CrossRef]
Sheikh, M.; Iqra, F.; Ambreen, H.; Pravin, K.A.; Ikra, M.; Chung, Y.S. Integrating artificial intelligence and high-throughput phenotyping for crop improvement. J. Integr. Agric. 2024, 23, 1787–1802. [Google Scholar] [CrossRef]
Sun, L.; Wang, X.; Zheng, Y.; Wu, Z.; Fu, L. Multiscale 3-D–2-D Mixed CNN and Lightweight Attention-Free Transformer for Hyperspectral and LiDAR Classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–16. [Google Scholar] [CrossRef]
Sharma, V.; Tripathi, A.K.; Mittal, H. DLMC-Net: Deeper lightweight multi-class classification model for plant leaf disease detection. Ecol. Inform. 2023, 75, 102025. [Google Scholar] [CrossRef]
Zhang, F.; Jin, X.; Lin, G.; Jiang, J.; Wang, M.; An, S.; Hu, J.; Lyu, Q. Hybrid attention network for citrus disease identification. Comput. Electron. Agric. 2024, 220, 108907. [Google Scholar] [CrossRef]
Hemalatha, S.; Jayachandran, J.J.B. A Multitask Learning-Based Vision Transformer for Plant Disease Localization and Classification. Int. J. Comput. Intell. Syst. 2024, 17, 188. [Google Scholar] [CrossRef]
Rezaei, M.; Diepeveen, D.; Laga, H.; Jones, M.G.; Sohel, F. Plant disease recognition in a low data scenario using few-shot learning. Comput. Electron. Agric. 2024, 219, 108812. [Google Scholar] [CrossRef]
Zeng, Q.; Niu, L.; Wang, S.; Ni, W. SEViT: A large-scale and fine-grained plant disease classification model based on transformer and attention convolution. Multimed. Syst. 2023, 29, 1001–1010. [Google Scholar] [CrossRef]
Sharma, V.; Tripathi, A.K.; Mittal, H.; Nkenyereye, L. SoyaTrans: A novel transformer model for fine-grained visual classification of soybean leaf disease diagnosis. Expert Syst. Appl. 2025, 260, 125385. [Google Scholar] [CrossRef]
Devi, R.S.S.; Kumar, V.R.V.; Sivakumar, P. InViTMixup: Plant disease classification using convolutional vision transformer with Mixup augmentation. J. Chin. Inst. Eng. 2024, 47, 520–527. [Google Scholar] [CrossRef]
Ali, A.H.; Youssef, A.; Abdelal, M.; Raja, M.A. An ensemble of deep learning architectures for accurate plant disease classification. Ecol. Inform. 2024, 81, 102618. [Google Scholar] [CrossRef]
Uğuz, S.; Şikaroğlu, G.; Yağız, A. Disease detection and physical disorders classification for citrus fruit images using convolutional neural network. Food Meas. 2023, 17, 2353–2362. [Google Scholar] [CrossRef]
Syed-Ab-Rahman, S.F.; Hesamian, M.H.; Prasad, M. Citrus disease detection and classification using end-to-end anchor-based deep learning model. Appl. Intell. 2022, 52, 927–938. [Google Scholar] [CrossRef]
Zhang, X.; Xun, Y.; Chen, Y. Automated identification of citrus diseases in orchards using deep learning. Biosyst. Eng. 2022, 223, 249–258. [Google Scholar] [CrossRef]
Hassam, M.; Khan, M.A.; Armghan, A.; Althubiti, S.A.; Alhaisoni, M.; Alqahtani, A.; Kadry, S.; Kim, Y. A Single Stream Modified MobileNet V2 and Whale Controlled Entropy Based Optimization Framework for Citrus Fruit Diseases Recognition. IEEE Access 2022, 10, 91828–91839. [Google Scholar] [CrossRef]
Dhiman, P.; Kukreja, V.; Manoharan, P.; Kaur, A.; Kamruzzaman, M.M.; Ben Dhaou, I.; Iwendi, C. A Novel Deep Learning Model for Detection of Severity Level of the Disease in Citrus Fruits. Electronics 2022, 11, 495. [Google Scholar] [CrossRef]
Chen, Y.; Wu, Q. Grape leaf disease identification with sparse data via generative adversarial networks and convolutional neural networks. Precis. Agric. 2023, 24, 235–253. [Google Scholar] [CrossRef]
Giakoumoglou, N.; Pechlivani, E.M.; Tzovaras, D. Generate-Paste-Blend-Detect: Synthetic dataset for object detection in the agriculture domain. Smart Agric. Technol. 2023, 5, 100258. [Google Scholar] [CrossRef]
Zhang, Z.; Zhan, W.; Sun, Y.; Peng, J.; Zhang, Y.; Guo, Y.; Sun, K.; Gui, L. Mask-guided dual-perception generative adversarial network for synthesizing complex maize diseased leaves to augment datasets. Eng. Appl. Artif. Intell. 2024, 136, 108875. [Google Scholar] [CrossRef]
Xu, H.; Li, Q.; Chen, J. Highlight Removal from A Single Grayscale Image Using Attentive GAN. Appl. Artif. Intell. 2022, 36, 1988441. [Google Scholar] [CrossRef]
Dai, G.; Tian, Z.; Fan, J.; Sunil, C.; Dewi, C. DFN-PSAN: Multi-level deep information feature fusion extraction network for interpretable plant disease classification. Comput. Electron. Agric. 2024, 216, 108481. [Google Scholar] [CrossRef]
Dai, G.; Fan, J.; Tian, Z.; Wang, C. PPLC-Net:Neural network-based plant disease identification model supported by weather data augmentation and multi-level attention mechanism. J. King Saud Univ.-Comput. Inf. Sci. 2023, 35, 101555. [Google Scholar] [CrossRef]
Barman, U.; Sarma, P.; Rahman, M.; Deka, V.; Lahkar, S.; Sharma, V.; Saikia, M.J. ViT-SmartAgri: Vision Transformer and Smartphone-Based Plant Disease Detection for Smart Agriculture. Agronomy 2024, 14, 327. [Google Scholar] [CrossRef]
Fan, Y.; Chen, C. OmiQnet: Multiscale feature aggregation convolutional neural network for omnidirectional image assessment. Appl. Intell. 2024, 54, 5711–5727. [Google Scholar] [CrossRef]
Zhang, J.; Li, X.; Wang, Y.; Wang, C.; Yang, Y.; Liu, Y.; Tao, D. EATFormer: Improving Vision Transformer Inspired by Evolutionary Algorithm. Int. J. Comput. Vis. 2024, 132, 3509–3536. [Google Scholar] [CrossRef]
Lai, B.; Liu, M.; Ryan, F.; Rehg, J.M. In the Eye of Transformer: Global–Local Correlation for Egocentric Gaze Estimation and Beyond. Int. J. Comput. Vis. 2024, 132, 854–871. [Google Scholar] [CrossRef] [PubMed]
Zhang, X.; Pu, L.; Wan, L.; Wang, X.; Zhou, Y. DS-MSFF-Net: Dual-path self-attention multi-scale feature fusion network for CT image segmentation. Appl. Intell. 2024, 54, 4490–4506. [Google Scholar] [CrossRef]
Cui, L.; Tian, X.; Wei, Q.; Liu, Y. A self-attention based contrastive learning method for bearing fault diagnosis. Expert Syst. Appl. 2024, 238, 121645. [Google Scholar] [CrossRef]
Yang, Z.; Sinnott, R.O.; Bailey, J.; Ke, Q. A survey of automated data augmentation algorithms for deep learning-based image classification tasks. Knowl. Inf. Syst. 2023, 65, 2805–2861. [Google Scholar] [CrossRef]
Gao, X.; Xiao, Z.; Deng, Z. High accuracy food image classification via vision transformer with data augmentation and feature augmentation. J. Food Eng. 2024, 365, 111833. [Google Scholar] [CrossRef]
Zaji, A.; Liu, Z.; Xiao, G.; Bhowmik, P.; Sangha, J.S.; Ruan, Y. AutoOLA: Automatic object level augmentation for wheat spikes counting. Comput. Electron. Agric. 2023, 205, 107623. [Google Scholar] [CrossRef]
Yang, S.; Shen, F.; Zhao, J. EntAugment: Entropy-Driven Adaptive Data Augmentation Framework for Image Classification. arXiv 2024, arXiv:2409.06290. [Google Scholar] [CrossRef]
Dehghan, A.; Razzaghi, P.; Abbasi, K.; Gharaghani, S. TripletMultiDTI: Multimodal representation learning in drug-target interaction prediction with triplet loss function. Expert Syst. Appl. 2023, 232, 120754. [Google Scholar] [CrossRef]
Chang, B.; Wang, Y.; Zhao, X.; Li, G.; Yuan, P. A general-purpose edge-feature guidance module to enhance vision transformers for plant disease identification. Expert Syst. Appl. 2024, 237, 121638. [Google Scholar] [CrossRef]
Jungiewicz, M.; Jastrzębski, P.; Wawryka, P.; Przystalski, K.; Sabatowski, K.; Bartuś, S. Vision Transformer in stenosis detection of coronary arteries. Expert Syst. Appl. 2023, 228, 120234. [Google Scholar] [CrossRef]
Wang, X.; Yang, J.; Hu, M.; Ren, F. EERCA-ViT: Enhanced Effective Region and Context-Aware Vision Transformers for image sentiment analysis. J. Vis. Commun. Image Represent. 2023, 97, 103968. [Google Scholar] [CrossRef]
Wang, D.; Cao, W.; Zhang, F.; Li, Z.; Xu, S.; Wu, X. A Review of Deep Learning in Multiscale Agricultural Sensing. Remote Sens. 2022, 14, 559. [Google Scholar] [CrossRef]
Wang, J.; Wang, Y.; Qi, Z. Remote Sensing Data Assimilation in Crop Growth Modeling from an Agricultural Perspective: New Insights on Challenges and Prospects. Agronomy 2024, 14, 1920. [Google Scholar] [CrossRef]
González-Rodríguez, V.E.; Izquierdo-Bueno, I.; Cantoral, J.M.; Carbú, M.; Garrido, C. Artificial Intelligence: A Promising Tool for Application in Phytopathology. Horticulturae 2024, 10, 197. [Google Scholar] [CrossRef]
Balaska, V.; Adamidou, Z.; Vryzas, Z.; Gasteratos, A. Sustainable Crop Protection via Robotics and Artificial Intelligence Solutions. Machines 2023, 11, 774. [Google Scholar] [CrossRef]
Holzinger, A.; Fister, I.; Fister, I.; Kaul, H.-P.; Asseng, S. Human-Centered AI in Smart Farming: Toward Agriculture 5.0. IEEE Access 2024, 12, 62199–62214. [Google Scholar] [CrossRef]
Rong, Y.; Leemann, T.; Nguyen, T.-T.; Fiedler, L.; Qian, P.; Unhelkar, V.; Seidel, T.; Kasneci, G.; Kasneci, E. Towards Human-Centered Explainable AI: A Survey of User Studies for Model Explanations. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 2104–2122. [Google Scholar] [CrossRef]

Figure 1. Network architecture of the proposed EDA–ViT.

Figure 2. Proposed Interaction Module and Improved Multiple Self-Attention (MSA) Mechanisms.

Figure 3. Sample images from the leaf disease and fruit disease dataset of fenugreek (Annona squamosa).

Figure 4. Confusion matrix for models with the highest and lowest classification accuracy in the custard apple testing set. The diagonal element indicates the higher accuracy of the model, i.e., the better the prediction of the model. In this study it was obtained from 300 randomly selected samples from the test set.

Figure 5. Testing error rates on the custard apple dataset using EDA–ViT.

Figure 6. Performance comparison of different data augmentation methods for the proposed EDA–ViT.

Figure 7. EDA–ViT visualization of CAM under diagnostic fenugreek crops.

Table 1. Experimental results of advanced neural network models on the custard apple dataset.

Method	Precision	Recall	F1-Score	Accuracy	MCC	FPS
DenseNet	0.9347	0.9319	0.9321	0.9326	0.8933	16.12
ResNetV2	0.9172	0.9159	0.9163	0.9160	0.8672	22.53
Xception	0.9383	0.9379	0.9378	0.9380	0.9113	21.39
EfficientNet	0.9317	0.9290	0.9280	0.9290	0.8891	26.21
Inception	0.9418	0.9406	0.9403	0.9290	0.9056	20.53
MobileNetV2	0.9211	0.9202	0.9201	0.9202	0.8670	21.91
VGG-19	0.9128	0.9115	0.9117	0.9116	0.8598	18.76
VGG-16	0.9143	0.9131	0.9133	0.9132	0.8621	17.89
DeiT-Ti	0.9269	0.9264	0.9265	0.9264	0.8824	25.53
LocalViT	0.9335	0.9303	0.9294	0.9303	0.8902	16.81
Swin	0.9294	0.9251	0.9251	0.9251	0.8816	20.79
ViT-B	0.9530	0.9523	0.9524	0.9523	0.9239	13.54
VMamba	0.9536	0.9536	0.9535	0.9536	0.9262	15.25
MaxViT	0.9498	0.9492	0.9493	0.9492	0.9185	13.67
EfficientFormerv2	0.9452	0.9445	0.9446	0.9445	0.9107	15.32
EDA–ViT	0.9658	0.9566	0.9610	0.9611	0.9224	12.22

Table 2. Proposed ablation experiment for EDA–ViT.

DMSA	EDA	EntLoss	CEloss	Cutout	CutMix	Trivialaugment	Accuracy (%)	MCC
✓			✓			✓	0.9478	0.9112
✓			✓		✓		0.9512	0.9148
✓			✓	✓			0.9425	0.9087
✓			✓				0.9278	0.8941
✓	✓		✓				0.9578	0.9209
✓		✓	✓				0.9333	0.9015
	✓	✓	✓				0.9425	0.9079
✓	✓	✓	✓				0.9611	0.9224

Table 3. Test results of the proposed EDA–ViT model on a broad novel plant disease dataset.

Method	Accuracy (↑)
Method	Paddy	Cabbage	Coffee	FGVC8
ResNet-50	0.9642	0.9965	0.7334	0.9464
RepVGG-A0	0.9579	0.9912	0.7475	0.9365
Swin-tiny	0.9554	0.9745	0.7596	0.9066
PVT-tiny	0.9537	0.9578	0.7365	0.9161
Swin-tiny+EFG [42]	0.9744	0.9956	0.8067	-
DFN-PSAN [29]	-	-	-	93.24
EDA–ViT	0.9811	0.9972	0.8359	0.9611

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cui, K.; Huang, J.; Dai, G.; Fan, J.; Dewi, C. Diagnosis of Custard Apple Disease Based on Adaptive Information Entropy Data Augmentation and Multiscale Region Aggregation Interactive Visual Transformers. Agronomy 2024, 14, 2605. https://doi.org/10.3390/agronomy14112605

AMA Style

Cui K, Huang J, Dai G, Fan J, Dewi C. Diagnosis of Custard Apple Disease Based on Adaptive Information Entropy Data Augmentation and Multiscale Region Aggregation Interactive Visual Transformers. Agronomy. 2024; 14(11):2605. https://doi.org/10.3390/agronomy14112605

Chicago/Turabian Style

Cui, Kunpeng, Jianbo Huang, Guowei Dai, Jingchao Fan, and Christine Dewi. 2024. "Diagnosis of Custard Apple Disease Based on Adaptive Information Entropy Data Augmentation and Multiscale Region Aggregation Interactive Visual Transformers" Agronomy 14, no. 11: 2605. https://doi.org/10.3390/agronomy14112605

APA Style

Cui, K., Huang, J., Dai, G., Fan, J., & Dewi, C. (2024). Diagnosis of Custard Apple Disease Based on Adaptive Information Entropy Data Augmentation and Multiscale Region Aggregation Interactive Visual Transformers. Agronomy, 14(11), 2605. https://doi.org/10.3390/agronomy14112605

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Diagnosis of Custard Apple Disease Based on Adaptive Information Entropy Data Augmentation and Multiscale Region Aggregation Interactive Visual Transformers

Abstract

1. Introduction

2. Literature Review

3. Materials and Methods

3.1. EDA–ViT

3.2. Multiscale Module

3.3. Interaction Module

3.4. Adaptive Data Augmentation

4. Experimental Results and Discussion

4.1. Introducing the Datasets

4.2. Experimental Details

4.3. Classification Results of Advanced Models

4.4. Performance Comparison of Hybrid EntLoss

4.5. Adaptive Data Augmentation Analysis

4.6. Extensive Testing for Plant Diseases

4.7. Vision Attention Visualization

5. Discussion

5.1. Synergistic Relationship Between DMSA and EDA

5.2. Challenges and Advances in Large-Scale Agricultural Monitoring

5.3. Agricultural Automation Can Explain Synergies

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI