Dual-Attention-Enhanced MobileViT Network: A Lightweight Model for Rice Disease Identification in Field-Captured Images

Zhang, Meng; Lin, Zichao; Tang, Shuqi; Lin, Chenjie; Zhang, Liping; Dong, Wei; Zhong, Nan

doi:10.3390/agriculture15060571

Open AccessArticle

Dual-Attention-Enhanced MobileViT Network: A Lightweight Model for Rice Disease Identification in Field-Captured Images

by

Meng Zhang

^1,2,3,4

,

Zichao Lin

^1,3,4,

Shuqi Tang

^1,3,4,

Chenjie Lin

^1,3,4,

Liping Zhang

²,

Wei Dong

²

and

Nan Zhong

^1,3,4,*

¹

College of Engineering, South China Agricultural University, Guangzhou 510642, China

²

Agricultural Economy and Information Research Institute, Anhui Academy of Agricultural Sciences, Hefei 230031, China

³

Guangdong Provincial Key Laboratory of Agricultural Artificial Intelligence (GDKL-AAI), Guangzhou 510642, China

⁴

National Center for International Collaboration Research on Precision Agricultural Aviation Pesticide Spraying Technology, Guangzhou 510642, China

^*

Author to whom correspondence should be addressed.

Agriculture 2025, 15(6), 571; https://doi.org/10.3390/agriculture15060571

Submission received: 7 February 2025 / Revised: 4 March 2025 / Accepted: 6 March 2025 / Published: 7 March 2025

(This article belongs to the Section Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

:

Accurate identification of rice diseases is crucial for improving rice yield and ensuring food security. In this study, we constructed an image dataset containing six classes of rice diseases captured under real field conditions to address challenges such as complex backgrounds, varying lighting, and symptom similarities. Based on the MobileViT-XXS architecture, we proposed an enhanced model named MobileViT-DAP, which integrates Channel Attention (CA), Efficient Channel Attention (ECA), and PoolFormer blocks to achieve precise classification of rice diseases. The experimental results demonstrated that the improved model achieved superior performance with 0.75 M Params and 0.23 G FLOPs, ensuring computational efficiency while maintaining high classification accuracy. On the testing set, the model achieved an accuracy of 99.61%, a precision of 99.64%, a recall of 99.59%, and a specificity of 99.92%. Compared to traditional lightweight models, MobileViT-DAP showed significant improvements in model complexity, computational efficiency, and classification performance, effectively balancing lightweight design with high accuracy. Furthermore, visualization analysis confirmed that the model’s decision-making process primarily relies on lesion-related features, enhancing its interpretability and reliability. This study provides a novel perspective for optimizing plant disease recognition tasks and contributes to improving plant protection strategies, offering a solution for accurate and efficient disease monitoring in agricultural applications.

Keywords:

rice diseases; lightweight model; deep learning; attention mechanism; visualization

1. Introduction

Rice, one of the most important staple crops worldwide, is the primary food source for the Asian population. However, during its growth cycle, rice is often affected by diseases such as rice blast, rice sheath blight, and rice bacterial leaf blight, which significantly impact both yield and quality. Traditional plant disease diagnosis methods mainly rely on manual observation and machine learning models that necessitate manual feature extraction [1,2,3]. Manual observation is not only time-consuming and labor-intensive but also limited by the observer’s technical expertise and experience, which can lead to misjudgments and omissions. Additionally, manual feature extraction involves complex preprocessing and segmentation steps tailored to different disease types and severity levels, making it difficult to capture the underlying patterns in images [4]. Therefore, rapid and accurate identification of rice diseases is crucial for implementing effective prevention and control measures, ensuring food security, and promoting sustainable agricultural development.

With the rapid advancement of computer vision technology, deep learning-based rice disease identification methods have emerged as promising solutions [5]. Convolutional neural networks (CNNs), a machine learning technique involving multi-layer neural networks, boast exceptional expressiveness and generalization capabilities, enabling the automatic extraction of high-level semantic features from large datasets [6]. By collecting image data from rice leaves or whole plants and using deep learning models to extract features, accurate rice disease identification can be achieved. However, the large parameter sizes and computational costs of these models present challenges for their deployment in resource-constrained agricultural environments. To address these limitations, lightweight CNN models, such as MobileNet and EfficientNet, have been proposed. These models were specifically chosen because of their proven ability to balance high accuracy with reduced computational requirements, making them ideal for mobile and embedded applications in agriculture. For instance, Elakya et al. [7] achieved 98.73% accuracy in classifying rice diseases using an enhanced MobileNetV2 model, while maintaining a lightweight architecture. Asvitha et al. [8] developed a smartphone-based rice disease and pest identification system using the MobileNetV3 framework, achieving an accuracy of 93.75%, which helps reduce crop losses. Wang et al. [9] proposed a lightweight EfficientNet model ensemble method, achieving an accuracy of 96.10% in classifying five types of rice tissue images while improving computational efficiency compared to traditional CNN models.

Despite these advancements, lightweight models still encounter challenges in complex field environments and with overlapping disease symptoms. Attention mechanisms have significantly enhanced the performance of deep learning models in plant disease image recognition by improving feature representation and focusing on disease-relevant regions. For instance, spatial attention mechanisms prioritize key areas within an image, improving lesion region localization and enhancing classification accuracy [10]. Channel attention mechanisms, such as Squeeze-and-Excitation (SE) blocks, refine inter-channel features by assigning higher weights to informative channels, thus improving performance in complex classification tasks [11]. Coordinate attention (CA), which combines spatial and channel information along coordinate axes, has been effectively utilized in lightweight CNN architectures to improve accuracy while maintaining computational efficiency [12]. Additionally, Transformer-based models, such as Vision Transformer (ViT) and Swin Transformer, have emerged as powerful alternatives, leveraging self-attention mechanisms to model long-range dependencies and enhance robustness in complex scenarios. Rachman et al. [13] demonstrated the effectiveness of ViT in classifying rice diseases under diverse field conditions, while Zhang et al. [14] utilized the Swin Transformer for high-precision disease detection, outperforming CNN models. To further improve accuracy and efficiency in plant disease recognition, recent studies have integrated CNN with Transformer architectures. Ding et al. [15] developed a model integrating deep transfer learning, achieving high accuracy in apple leaf disease classification with efficient computational requirements, while Thai et al. [16] combined CNN’s local feature extraction with the Transformer’s global learning, demonstrating strong performance on multiple crop datasets while ensuring real-time processing. These approaches highlight the potential of hybrid models for efficient and accurate disease detection.

Although deep learning models achieve high accuracy in plant disease classification, they are often criticized for their “black-box” nature, which limits interpretability, raises trust concerns, and complicates error diagnosis in critical applications such as agriculture. This lack of transparency prevents users from understanding how models make decisions, particularly in cases of incorrect predictions or overlapping disease classes. To address these challenges, various visualization techniques have been developed to offer insights into model behavior. Karim et al. [17] developed a grape leaf disease classification model based on MobileNetV3-Large and used the Gradient-weighted Class Activation Mapping (Grad-CAM) technique to identify and highlight the image regions involved in the model’s decision-making process. Maeda-Gutiérrez et al. [18] developed multiple deep learning models for taro disease image classification and used SHapley Additive exPlanations (SHAP) to explain the models, revealing distinct classification strategies and identifying key features, such as leaf veins and color patterns, that the models rely on for disease prediction. Hu et al. [19] introduced a decoupled feature learning framework, leveraging causal inference techniques to build a pest image classification model, and used t-Distributed Stochastic Neighbor Embedding (t-SNE) to visualize the model’s feature extraction performance, demonstrating improved inter-class separation and reduced intra-class variance. These techniques bridge the gap between model decisions and human understanding, enhancing trust and enabling more reliable deployment in real-world applications.

However, despite the proven effectiveness of attention mechanisms in enhancing deep learning models, certain limitations limit their performance in complex environments. While mechanisms like CA and Efficient Channel Attention (ECA) effectively improve feature selection, a single attention mechanism may not be sufficient for tasks with intricate backgrounds, such as rice disease classification in field conditions. The diversity in leaf shapes, disease spots, and complex field backgrounds presents significant challenges for these mechanisms in capturing all relevant spatial dependencies [20]. Additionally, while Grad-CAM and SHAP provide valuable insights into model decisions, they also have limitations. Grad-CAM, although useful for highlighting relevant image regions, lacks the depth required for a comprehensive understanding of decision-making processes. Similarly, SHAP values assist in feature attribution but may not fully capture the complex interactions between features, particularly in images with high variability.

In conclusion, despite advancements in rice disease identification, many existing methods are hindered by their reliance on a single attention mechanism, which fails to capture sufficient disease features. This limitation not only affects the accuracy of detection but also increases computational complexity. This paper proposes an enhanced MobileViT model, named MobileViT-DAP, that improves the accuracy of rice disease identification while reducing model parameters and computational complexity, thus enabling efficient, accurate, and non-destructive identification. Our contributions are summarized as follows: (1) a high-quality rice disease image dataset was developed from field environments to address data scarcity; (2) the MobileViT-DAP model was enhanced by integrating CA and ECA mechanisms, which improved the interaction between spatial and channel features; and (3) multiple visualization techniques, including Grad-CAM, SHAP, and t-SNE, were employed to improve model interpretability and transparency.

2. Materials and Methods

2.1. Image Data Collection

The rice disease image dataset used in this study was primarily compiled by the Anhui Academy of Agricultural Sciences. Image acquisition was conducted using six high-performance digital cameras, including models such as the D810, D800, and D750 (Nikon, Tokyo, Japan), as well as several smartphone devices. The data collection period spanned from 2015 to 2023, covering various regions of China, including Anhui, Guangdong, and Hunan provinces. Images were captured under diverse conditions, including both natural and artificial lighting, and in settings with both complex and simple backgrounds. Efforts were made to ensure the diversity of the images. Each image had a resolution of at least 3000 × 2000 pixels and was saved in JPG format. Disease classification labels were manually verified by plant protection experts from the Anhui Academy of Agricultural Sciences, indicating both the disease type and its corresponding identification number. Based on factors such as frequency of occurrence, the affected area, and the number of available images, six common rice diseases were selected for this study: rice brown spot, rice bacterial leaf blight, rice sheath blight, rice bacterial leaf streak, rice false smut, and rice blast. The typical symptoms of these six diseases are shown in Figure 1.

Among the six rice diseases studied in this paper, rice brown spot, rice bacterial leaf blight, and rice bacterial leaf streak primarily affect the leaves, while rice false smut affects the panicle. These diseases were used to compare the disease image features from different plant parts. Rice sheath blight and rice blast can occur on the stems, leaf sheaths, or leaves. Diseases from different plant parts were grouped together to improve the model’s robustness and generalizability. Additionally, regarding lesion shapes, rice bacterial leaf blight, and rice bacterial leaf streak primarily exhibit elongated lesions, while the other diseases are characterized by irregular spots and patches. Furthermore, since the disease images used in this study were collected in batches from the field, covering multiple stages of the disease cycle, the lesion characteristics and contextual information became diversified. Factors such as camera parameters and light intensity may also affect the classification model’s performance, presenting challenges to this study.

In this study, we did not include a set of healthy rice leaf images for classification. The primary reason for this decision is that healthy leaves are relatively easy to identify compared to diseased ones and do not require specialized knowledge for accurate classification. Moreover, in datasets that include both healthy and diseased samples, the healthy category typically achieves higher accuracy, indicating that their exclusion has minimal impact on overall model performance [21,22]. Furthermore, in the context of the visualization analysis, the key evaluation criterion is the overlap between the model’s attention regions and the disease spots. Since healthy leaves lack disease lesions, their inclusion could potentially interfere with the interpretability analysis. Therefore, to focus on enhancing disease detection and understanding model behavior in the presence of disease lesions, only diseased leaf images were included in the dataset.

2.2. Dataset Construction and Preprocessing

The dataset was randomly divided into training, validation, and testing sets at a ratio of 7:1:2 [23]. After image annotation, 2331 images captured under various weather and light conditions were selected as the test set, and the remaining 9332 images were used to train the network. The 8-fold cross-validation approach was employed for model training, with the validation set used to fine-tune the model parameters. The specific details of rice disease images are shown in Table 1.

Due to the large size of the original images, preprocessing was performed. For example, the preprocessing steps were applied to input images of size h × w, where h and w represent the height and width of the image, respectively. First, we resized the input image to h’ × w’, maintaining the original aspect ratio. Initially, the smaller of h and w was resized to 640 pixels. Then, the larger dimension was adjusted by multiplying it by the ratio of the larger value to the smaller one. Additionally, data augmentation was applied to the original images to meet the requirements of rice disease identification tasks. Data augmentation enhanced the model’s robustness and accuracy through random cropping, horizontal flipping, and center cropping. In this study, we used the PyTorch 1.12.1 framework and Torchvision for data augmentation.

2.3. MobileViT Network Structure

MobileViT is a lightweight hybrid model that combines CNN and ViT to achieve high computational efficiency and strong performance in visual tasks. The model uses convolutional layers to extract local features, capturing fine-grained spatial information such as edges and textures. These features are then passed through Transformer blocks, which capture long-range dependencies and global context, enabling an effective understanding of complex image patterns. The key strength of MobileViT lies in its efficiency. By combining convolutional operations with Transformer layers, the model reduces computational costs compared to traditional ViT models, making it ideal for deployment on resource-constrained devices such as mobile phones and embedded systems. This hybrid design also allows for multi-scale feature extraction, enabling the model to capture both local details and global context, making it effective in tasks such as image classification.

Figure 2 illustrates the structure of MobileViT, which consists primarily of depthwise convolution layers, MV2 layers (inverted residual blocks from MobileNetV2), MobileViT blocks, global pooling, and fully connected layers. The most important and central component is the MobileViT block, whose workflow begins with the application of a 3 × 3 convolution layer for local feature representation. Next, a 1 × 1 convolution layer is applied to adjust the number of channels. Subsequently, the Unfold, Transformer, and Fold structures are applied for global feature modeling. A 1 × 1 convolution layer is then applied to restore the number of channels to its original size. Finally, a shortcut branch is introduced to connect the feature map along the channel dimension with the original input feature map, followed by a 3 × 3 convolution layer for feature fusion to produce the output.

A critical step in the workflow of MobileViT involves replacing local modeling in convolution with global modeling using Transformer. This requires the Unfold and Fold operations to reshape the data into the format necessary for computing self-attention [25,26]. As shown in Figure 3, during the self-attention computation within the MobileViT block, only tokens of the same color are considered for self-attention calculation to reduce computational complexity. The Unfold operation flattens tokens of the same color into a sequence, allowing parallel computation of each sequence using the original self-attention mechanism. The Fold operation then folds the sequences back into the original feature map. Through this process, MobileViT effectively integrates both local and global features, ultimately performing feature fusion via a convolutional layer to produce the output. This design reduces computational demands while maintaining high performance.

2.4. Attention Module

2.4.1. Efficient Channel Attention

ECA mechanism is a lightweight and computationally Efficient Channel Attention module designed to enhance the representational power of CNNs. Unlike traditional methods, ECA uses a 1D convolution to capture local cross-channel dependencies without dimensionality reduction, enabling the model to adaptively focus on important channels. This approach improves performance with minimal computational cost, making it ideal for real-time applications and resource-constrained environments. While ECA excels at enhancing channel-wise features, its ability to capture global dependencies may be limited, reducing its effectiveness in tasks requiring extensive global context. Overall, ECA strikes a balance between efficiency and performance, offering a robust solution for various visual tasks.

As shown in Figure 4, the ECA mechanism replaces the fully connected layers in the original SENet module with a 1 × 1 convolutional layer, which directly learns from the features obtained after global average pooling. An adaptive method is then employed to select the convolution kernel size for each channel, followed by a 1D convolutional layer. Weights are obtained through a sigmoid function and multiplied back onto the original feature map to produce the final output. The size of the 1 × 1 convolution kernel in ECA determines the extent of cross-channel interaction and the number of channels involved in each weight calculation, making it a critical factor. ECA mechanism efficiently captures channel-wise relationships in convolutional feature maps, enhancing feature extraction and representation learning.

2.4.2. Coordinate Attention

A CA mechanism is designed to capture relationships between different positions in an image, considering that each position contains unique and important coordinate information. It captures attention across the image’s width and height, encoding precise positional information. CA offers advantages such as efficient spatial feature integration, low computational cost, and improved accuracy in tasks like object detection and segmentation. However, its performance may degrade when global spatial context is required across large feature maps, and it may not capture long-range dependencies as effectively as self-attention mechanisms. Despite these limitations, CA remains a powerful tool for enhancing spatial feature representation in convolutional networks.

As shown in Figure 5, the CA mechanism performs global average pooling separately along the width and height of the input feature map, and then concatenates and reduces their dimensions using a convolutional layer. After applying batch normalization and a nonlinear activation, a 1 × 1 convolution followed by a sigmoid function generates two attention weight matrices, which are multiplied to form the final attention map. This map is then added back to the original feature map. CA mechanism introduces minimal computational overhead, preserving the model’s lightweight nature. By incorporating positional information, CA enhances spatial selectivity, captures long-range dependencies, and improves accuracy and generalization, thus making a significant contribution to feature extraction and representation learning in image analysis.

2.5. Transfer Learning of Rice Disease Identification Using MobileViT-DAP

Transfer learning is a machine learning technique that transfers existing knowledge from one domain to another to improve the learning performance in the target domain [28]. In image identification, transfer learning can leverage vast amounts of data and pre-trained models to learn new image identification tasks without the need to train complex neural networks from scratch. This approach saves time and computational resources while improving the model’s generalization capabilities and accuracy. In this paper, transfer learning is employed for model training.

Single attention mechanisms may not fully capture the complexities of classification tasks, especially those involving intricate patterns or fine-grained feature interactions. In this context, CA excels in spatial encoding by effectively capturing positional information but struggles to model inter-channel dependencies. On the other hand, ECA focuses on efficient cross-channel refinement, optimizing channel-wise attention but lacks strong spatial localization capabilities. To address these limitations, we enhance MobileViT by incorporating a dual-attention mechanism that combines both CA and ECA mechanisms. This modification aims to improve the model’s ability to capture both fine-grained spatial features and inter-channel dependencies. MobileViT-XXS utilizing the smallest pre-trained weights from the MobileViT, serves as the foundation for this enhancement, resulting in a more efficient and effective model suitable for complex classification tasks.

Figure 6c presents the overall architecture of MobileViT-DAP. The main modifications include replacing the MV2 module with the MbECA module, substituting the last two MobileViT blocks with the PoolFormer module, and adding a CA module after the final 1 × 1 convolutional layer. As shown in Figure 6a, the MbECA module is implemented by adding an ECA module to the MV2 module [29]. Specifically, its input feature map first passes through a 1 × 1 convolutional layer, which increases the number of channels, followed by an optional downsampling depthwise separable convolution operation consisting of a 3 × 3 channel-wise convolutional layer and a 1 × 1 point-wise convolutional layer. Subsequently, the number of channels is reduced through another 1 × 1 convolutional layer, and the ECA module is incorporated. Finally, a residual connection is applied. As illustrated in Figure 6b, the PoolFormer architecture is simple yet achieves highly competitive performance, demonstrating that structural simplification can be implemented without compromising model performance [30]. The added CA module at the end is designed to capture additional spatial positional feature information. Overall, the model enhances spatial feature perception through CA, addressing the limitations of spatial information in MobileViT. It strengthens channel feature relationships through ECA, improving the ability to capture finer details, and further enhances the model’s lightweight and efficient characteristics with PoolFormer.

2.6. Evaluation Standard

When evaluating model performance, we selected precision, recall, specificity, and accuracy as evaluation metrics. Precision represents the proportion of correct positive predictions out of all positive predictions. Recall represents the proportion of correct positive predictions out of all positive events. Specificity represents the probability of correctly predicting negative samples out of all negative samples, assessing the model’s ability to identify negative class samples. Accuracy represents the proportion of correct predictions out of all predicted samples.

P r e c i s i o n = \frac{T P}{T P + F P}

(1)

R e c a l l = \frac{T P}{T P + F N}

(2)

S p e c i f i c i t y = \frac{T N}{T N + F P}

(3)

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(4)

Among these metrics, TP (true positives) refers to the number of actual positive samples predicted as positive by the model, TN (true negatives) refers to the number of actual negative samples predicted as negative, FP (false positives) refers to the number of samples predicted as positive but are actually negative, and FN (false negatives) refers to the number of samples predicted as negative but are actually positive.

2.7. Interpretability Methods

Visualization techniques, such as Grad-CAM, SHAP, and t-SNE, are crucial for improving the interpretability of machine learning models by offering insights into how models make decisions. These techniques map the model’s internal representations and decision-making process to visual cues, helping identify the features or regions in the input data that contribute most to predictions.

Grad-CAM is a powerful technique for visualizing which regions of an image contribute most to a model’s decision by generating class-specific heatmaps. It calculates the gradients of the target class with respect to the final convolutional layer and uses these gradients to weight the feature maps, highlighting discriminative areas in the input image [31]. Grad-CAM is non-invasive and model-agnostic, making it applicable to a wide range of deep learning models, including those used for image classification. It enhances the interpretability of complex models by providing spatial attention maps, facilitating a better understanding of the decision-making process. However, its reliance on the final convolutional layer may lead to imprecise localization, and the generated heatmaps can be noisy, particularly in fine-grained tasks.

SHAP is a game theory-based method that provides model-agnostic explanations by assigning each feature a contribution value based on its impact on the model’s prediction. It ensures fairness and consistency by calculating Shapley values, which sum to the model’s output, and is particularly useful for local interpretability, providing insights into individual predictions [32]. SHAP’s advantages include its ability to explain complex black-box models and its theoretical foundation, making it highly reliable for feature importance analysis. However, its computational complexity can be a limitation, particularly for models with many features, and its effectiveness depends on the quality of the data and the model.

T-SNE is a nonlinear dimensionality reduction technique used to visualize high-dimensional data in lower dimensions. It minimizes the divergence between probability distributions representing pairwise similarities in both high-dimensional and low-dimensional spaces, preserving local structures. T-SNE is particularly effective in revealing clusters and complex relationships, making it valuable for exploratory data analysis and model evaluation. However, it has limitations, including high computational complexity, sensitivity to hyperparameters, and poor preservation of global structures.

Each technique has its limitations. No single method can fully explain a model’s decision-making process. Grad-CAM may lack precision in deeper layers, SHAP can be computationally expensive for large datasets, and t-SNE may fail to preserve global structures effectively. These limitations underscore the need to use a combination of these techniques, allowing for a more comprehensive and complementary understanding of model behavior.

3. Results

3.1. Experimental Platform and Parameter Settings

In this experiment, the environment was based on Ubuntu, with the training setup using Python 3.7, Pytorch 1.12.1, and Torchvision 0.13.1. The system was equipped with an Intel Xeon Gold 6142 CPU and an RTX 3080 GPU with 10.5 GB of memory. The AdamW optimizer was employed for its advantages, including fast convergence, ease of parameter tuning, and its capability to mitigate overfitting. Cross-entropy was used as the loss function during training [33]. We initialized the optimizer with a learning rate of 1 × 10⁻³, as preliminary experiments indicated that this value provided a good balance between convergence speed and stability, while also being supported by prior studies [34]. To further refine training, cosine annealing was applied for learning rate scheduling [35]. The batch size was set to 16, which allowed for efficient use of GPU memory and led to stable gradient updates as observed in our experiments. Additionally, the maximum number of training epochs was fixed at 50, ensuring that the model was sufficiently trained without excessive risk of overfitting. To address potential overfitting, we set the L2 weight decay parameter to 1 × 10⁻².

3.2. Comparative Experiment Analysis of Lightweight Backbone Networks

To select the baseline model, the classification performances of lightweight models, including EfficientNet-B0, MobileNetV2, MobileNetV3-Small, MobileNetV3-Large, and MobileViT-XXS, were compared on the rice disease image dataset. The experimental comparison results are shown in Table 2.

The results highlight that MobileViT-XXS achieves the best overall performance with the highest accuracy (96.99%), precision (96.50%), recall (96.64%), and specificity (99.39%) while maintaining the lowest Params (0.95 M) and moderate computational complexity (0.26 G FLOPs). This demonstrates its superior balance between efficiency and predictive capability. In contrast, models such as EfficientNet-B0 and MobileNetV3-Small, despite having higher Params or lower FLOPs, exhibit inferior performance, indicating that model size and complexity do not directly correlate with accuracy. These results underscore the potential of MobileViT-XXS for high-performance, resource-constrained applications.

3.3. Network Improvement Ablation Experiments

To evaluate the impact of different attention mechanisms on the classification performance of the MobileViT-XXS model on the rice disease image dataset, two sets of ablation experiments were conducted. The results of the ablation experiments are presented in Table 3.

The first row in Table 3 presents the classification results using MobileViT-XXS as the baseline. The second row shows the experimental results of combining MobileViT-XXS with the CA mechanism, where no significant improvement or decline was observed across the evaluation metrics. For the ECA mechanism, the third row displays the results of integrating MobileViT-XXS with ECA, showing varying degrees of improvement across all four evaluation metrics. The final row presents the classification results of the MobileViT-XXS model combined with the dual-attention mechanism, achieving the best performance with an accuracy of 99.66%, a precision of 99.70%, a recall of 99.61%, and a specificity of 99.93%, clearly demonstrating the superior performance gained through the combination of attention mechanisms. Meanwhile, the Params and FLOPs of each model did not show significant changes, indicating that the incorporation of CA and ECA had no noticeable impact on the model’s size and complexity.

3.4. Experimental Analysis of MobileViT-DAP

The ablation experiments above demonstrate the superior performance of the dual-attention-enhanced MobileViT model. Building on this, the MobileViT-DAP model proposed in this paper further enhances model efficiency by replacing two MobileViT blocks with PoolFormer blocks, achieving greater lightweight optimization. Figure 7 compares the confusion matrices of MobileViT+CA+ECA and MobileViT-DAP for rice disease classification.

The results indicate that MobileViT-DAP achieves performance comparable to the original model, with no significant differences in overall classification accuracy. The accuracy, precision, recall, and specificity reached 99.61%, 99.64%, 99.59%, and 99.92%, respectively. While MobileViT-DAP shows slightly higher accuracy in classes 10000, 10017, and 10046, it exhibits a minor decrease in accuracy for classes 10021 and 10047. Most misclassifications appear to arise from visually similar disease symptoms and overlapping lesion features. For instance, the lesion characteristics of classes 10000, 10018, and 10047 are all irregular spots or patches, while classes 10017 and 10021 exhibit elongated lesions, which results in fewer misclassifications between these two groups. In contrast, categories with similar lesion appearances are more prone to confusion. Additionally, class 10046, associated with spikelet infection, differs markedly from the other categories, leading to consistently stable classification accuracy. Despite these variations, the performance metrics between the two models remain largely consistent, suggesting that the addition of PoolFormer blocks in MobileViT-DAP does not significantly compromise overall classification accuracy. The primary advantage of MobileViT-DAP lies in its lightweight architecture, with the Params reduced to 0.75 M while maintaining FLOPs at 0.23 G, achieving similar classification performance with reduced model complexity. This optimization makes it more efficient for deployment in resource-constrained environments. Additionally, to provide a more intuitive comparison of the performance between MobileViT-DAP and traditional lightweight models, Figure 8 presents the accuracy and loss curves of each model on the training and testing sets.

The results highlight the superior performance of MobileViT-DAP. It achieves the highest training and testing accuracy with rapid convergence and maintains the lowest training and testing loss with minimal fluctuations, indicating strong generalization ability and model stability. Compared to other models that exhibit slower convergence and higher loss values, MobileViT-DAP demonstrates more efficient learning, faster optimization, and enhanced robustness, making it highly suitable for real-world applications requiring accurate and reliable predictions.

Table 4 summarizes the performance metrics of MobileViT-DAP, demonstrating consistently high precision, recall, and specificity across all classes, with most values exceeding 0.99. The model achieves perfect precision and specificity (1.0000) for classes 10018 and 10046, indicating no false positive predictions. However, the recall values for these classes (0.9967 and 0.9970, respectively) suggest the presence of a small number of false negatives, meaning some actual positive samples were not correctly identified. For classes 10021 and 10047, slightly lower precision and recall values are observed, but the differences are minimal, with specificity remaining above 0.9980. These results confirm the model’s strong classification capability and robust performance, even after lightweight optimization.

3.5. Interpretability Analysis

3.5.1. Grad-CAM

Grad-CAM utilizes the gradient information from the backpropagation process to compute the weights of each channel in the feature map, generating heatmaps that highlight the regions of interest in the network. To assess the performance of the model, a sample was selected from each disease class where the original MobileViT model showed suboptimal attention focus, and it was compared with the models enhanced by CA, ECA, and dual-attention mechanisms using Grad-CAM.

As shown in Figure 9, compared to the original MobileViT, the MobileViT models with attention mechanisms demonstrate an improved focus on more accurate and relevant target boundaries. While MobileViT+ECA shows better performance than the original MobileViT, it is still inferior to MobileViT+CA, as the former captures more irrelevant background information or fails to fully capture the disease lesion features. For most samples, except for 10017, MobileViT-DAP and MobileViT+CA exhibit similar attention regions, both of which effectively cover the lesion areas. However, for the long, strip-shaped lesion in sample 10017, MobileViT-DAP shows a more complete and accurate focus on the lesion region. These results indicate that the proposed modifications to MobileViT enhance its ability to capture lesion features and improve the model’s reliability.

3.5.2. SHAP

Unlike Grad-CAM, SHAP reveals the pixels and patterns the model relies on to differentiate between various disease images. SHAP values provide a visual representation of each pixel’s positive or negative contribution to the model’s decision-making process.

Figure 10 presents the interpretability analysis of MobileViT-DAP for each disease class, showing the pixel contributions when classifying a disease image into different categories. It can be observed that when the classification is correct, the red regions primarily focus on the lesion areas, while the healthy areas or background regions, which do not contain lesions, exhibit fewer blue regions. These blue regions represent noise that interferes with the model’s correct classification. In contrast, when classification errors occur, some samples exhibit red regions in the lesion area, indicating that symptoms of different rice diseases share similarities, highlighting the challenges of the task. However, for most samples, the blue regions are concentrated in the lesion areas, indicating that the key features the model relies on for decision-making are the lesion features, further validating the model’s reliability.

3.5.3. T-SNE

To more effectively illustrate the model’s identification performance for the six disease types, the t-SNE clustering algorithm was used for a two-dimensional visualization of the semantic feature space.

As shown in Figure 11, t-SNE was used to visualize the feature extraction performance of MobileViT-DAP on the test set. The results indicate that the intra-class distances for groups 10018, 10021, and 10046 are relatively small, while the intra-class distances for groups 10000, 10017, and 10047 are larger. This discrepancy may be attributed to the significant feature variations that occur at different disease stages. Furthermore, there is a clear inter-class distance between the six disease categories, demonstrating the model’s high performance in classification. These observations suggest that our method encourages the model to focus on the core distinguishing features of each class, preventing biased data distributions from leading the model to learn irrelevant features or creating spurious associations between features and predicted labels.

Interestingly, a small number of samples, such as 10021 and 10017, cluster together in the feature space, which may be attributed to the unique lesion-spreading characteristics of sample 10021. As shown in Figure 12, 10021 initially produces narrow yellow lesions, but as the disease progresses, multiple lesions may merge into a larger, more consolidated lesion. This characteristic resembles the early-stage lesion features of 10017, which may explain the clustering of these two samples. This observation highlights the similarity in lesion patterns between different rice diseases, further demonstrating the complexity of distinguishing between them in the classification task.

4. Discussion

Accurate identification of rice diseases is essential for effective disease management and yield protection. However, rice disease recognition in real-world field conditions presents significant challenges due to its inherent complexity. This complexity arises from several factors. First, inter-class symptom similarity makes it difficult to distinguish between different diseases, as many pathogens cause visually overlapping lesions or discolorations. Second, within the same disease category, intra-class variability exists due to the progression of symptoms across different growth stages, leading to variations in lesion size, color, and texture. Additionally, the diversity of infection sites on rice plants—including leaves, stems, and panicles—further complicates accurate classification. Moreover, the complexity of field environments exacerbates these challenges, as factors such as varying background conditions, fluctuating light intensity, and occlusions caused by overlapping leaves or other plants introduce noise and inconsistencies in image data [36,37]. These issues collectively highlight the need for robust, adaptable models capable of handling the diverse visual characteristics, and environmental variability inherent in rice disease detection tasks.

Compared to MobileViT-XXS, MobileViT-DAP integrates CA, ECA, and PoolFormer modules (Figure 6). These enhancements enable efficient deployment on resource-constrained edge devices, providing a high-performance real-time solution for rice disease identification. Compared to studies [38,39,40], which introduced a single attention module to enhance model performance, this study incorporates a dual-attention mechanism, resulting in greater improvements in classification performance. While CA enhances spatial feature encoding by capturing positional dependencies, it has limited capability in modeling inter-channel relationships. Conversely, ECA efficiently refines channel-wise dependencies but lacks spatial localization ability. The combination of both mechanisms enables the model to capture fine-grained spatial details and robust channel interactions simultaneously, improving feature representation and classification performance. Additionally, the integration of the PoolFormer module further reduces the model’s complexity without compromising performance, while incorporating model complexity and computational efficiency as key evaluation metrics to demonstrate the superiority of the proposed method.

MobileViT-DAP proposed in this paper reduces the Params by 21% compared to the original MobileViT-XXS while achieving improvements in classification performance. As shown in Table 2 and Figure 8, compared to traditional lightweight models, such as EfficientNet and MobileNet, MobileViT-DAP demonstrates a unique trade-off between model size, computational efficiency, and classification performance. For example, MobileNetV3-Small achieves the lowest FLOPs (0.06 G vs. 0.23 G) compared to MobileViT-DAP but falls short in both model size and classification performance. Other lightweight models achieve higher classification performance at the cost of increased model size and computational complexity, yet they still lag behind MobileViT-DAP’s performance gains. Moreover, MobileViT-DAP exhibits appropriate inference times (5.15 ms) and a low memory footprint (3.03 MB), underscoring its suitability for real-world deployments in resource-constrained agricultural environments. Figure 13 provides a more intuitive comparison of the classification performance of different models on the rice disease image dataset constructed in this study. The results indicate that MobileViT-DAP achieves the highest accuracy, precision, recall, and specificity, with values approaching 1.0, demonstrating its excellent classification performance and robustness. Furthermore, to mitigate overfitting risks associated with a more intricate design, we have implemented L2 regularization strategies and performed extensive cross-validation. These measures ensure that the performance gains are achieved without compromising the model’s generalization capabilities. Overall, despite adopting a lightweight architecture, MobileViT-DAP maintains outstanding performance, highlighting its efficiency and suitability for real-world applications.

Deep learning models often face interpretability challenges due to their black-box nature. In many earlier studies, the interpretability of models was not extensively explored due to the limitations of visualization techniques available at that time [41,42]. Consequently, it was challenging to ascertain whether the models’ decisions were based solely on the intended target features or were inadvertently influenced by extraneous factors such as background information or image parameters. This lack of transparency can undermine trust in the model’s predictions. For instance, as shown in the second column of Figure 9, although MobileViT achieves correct classification, the regions it focuses on have low overlap with the actual lesion areas, raising concerns about the model’s reliability. To address this, we employed Grad-CAM and SHAP for model explanation. Grad-CAM highlights key image regions influencing predictions but is limited by its reliance on convolutional features, resulting in low-resolution heatmaps and reduced effectiveness in models with non-convolutional architectures [43]. In contrast, SHAP provides granular feature-level attributions, yet its computational cost is high for complex models, and it assumes feature independence, which may lead to biased interpretations in datasets with correlated features [18]. It is important to note that while the combined use of Grad-CAM and SHAP significantly enhances the interpretability of our model, the increased architectural complexity of MobileViT-DAP introduces inherent challenges. These challenges may limit the overall transparency of the decision-making process compared to simpler models, representing a trade-off between achieving high classification performance and maintaining model interpretability. In this study, we combined Grad-CAM’s spatial visualization with SHAP’s feature importance analysis for a comprehensive understanding of model decisions. Additionally, t-SNE was applied to visualize high-dimensional feature distributions, offering further insights into the model’s learned representations.

Although MobileViT-DAP demonstrated excellent performance in the experiments, it still has certain limitations. First, the experiments focused on data within a limited scope, and the model’s generalizability requires further validation and optimization across more diverse application scenarios. While this study constructed a large, high-quality rice disease image dataset, establishing similarly scaled datasets for other applications may not be feasible. Therefore, applying the model to small-sample classification tasks remains an area for further exploration. Additionally, the efficient attention mechanism enables the model to focus on lesion regions to improve its reliability. However, for samples with similar lesion features, it may increase the likelihood of misclassification. As shown in Figure 12, late-stage rice bacterial leaf streak was misclassified as early-stage rice bacterial leaf blight, highlighting the challenge of distinguishing between these similar symptoms. Moreover, the model’s performance in handling unseen diseases or environmental variations has not been thoroughly evaluated, which could affect its robustness in real-world conditions. Another relevant factor that may contribute to misclassification is nutrient deficiency, particularly potassium deficiency. It can cause leaf discoloration and necrotic margins, which closely resemble bacterial leaf blight and may lead to false positives in classification. Future research should focus on enhancing the model’s ability to capture fine-grained features and extend its attention to incorporate some contextual information, thereby improving its discriminatory power and reducing the likelihood of misclassification in complex classification tasks.

5. Conclusions

In this study, we constructed a large, high-quality rice disease image dataset to support the development and evaluation of deep learning models for plant disease classification. Based on the MobileViT architecture, we proposed an improved model named MobileViT-DAP by integrating CA, ECA, and PoolFormer blocks. The experimental results demonstrated that, compared to various lightweight models, the improved model achieved significant enhancements in terms of model complexity, computational efficiency, and classification performance. Specifically, compared to the baseline model, MobileViT-DAP achieved an impressive reduction of 21% in Params, while maintaining high performance with an accuracy of 99.61%, a precision of 99.64%, a recall of 99.59%, and a specificity of 99.92%, effectively balancing lightweight design and high accuracy. Furthermore, the visualization analysis revealed that the model’s attention regions during the decision-making process highly overlapped with the actual lesion areas, indicating that the model primarily relies on disease-specific features for classification. This not only enhances the trustworthiness of the model’s predictions but also further validates its superior performance. Overall, this work offers a novel perspective for optimizing plant disease recognition tasks. In future work, we aim to expand the model’s application scenarios and further optimize its performance across diverse agricultural environments.

Author Contributions

Conceptualization, M.Z. and W.D.; methodology, M.Z. and Z.L.; software, Z.L. and S.T.; validation, W.D. and L.Z.; formal analysis, C.L. and S.T.; investigation, W.D.; resources, M.Z. and W.D.; data curation, M.Z. and W.D.; writing—original draft preparation, M.Z. and Z.L.; writing—review and editing, M.Z.; visualization, M.Z. and Z.L.; supervision, N.Z.; project administration, N.Z. and L.Z.; funding acquisition, N.Z. and L.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key Research and Development Program of China (2021YFD20002), the Science and Technology Planning Project of Guangdong Province (2021B1212040009), the National Natural Science Foundation of China (32072363), the Central Leading Local Science and Technology Development Special Plan (202407a12020010), the Research Project of Anhui Academy of Agricultural Sciences (2025YL074), and the Agricultural Science and Technology Achievement Transformation Project of Anhui Academy of Agricultural Sciences (2025ZH032).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Abbreviations

The following abbreviations are used in this manuscript:

CNNs	Convolutional neural networks
SE	Squeeze-and-Excitation
CA	Channel Attention
ECA	Efficient Channel Attention
ViT	Vision Transformer
Grad-CAM	Gradient-weighted Class Activation Mapping
SHAP	SHapley Additive exPlanations
t-SNE	t-Distributed Stochastic Neighbor Embedding
Params	Parameters
FLOPs	Floating-point operations

References

Padol, P.B.; Yadav, A.A. SVM classifier based grape leaf disease detection. In Proceedings of the 2016 Conference on Advances in Signal Processing (CASP), Pune, India, 9–11 June 2016; pp. 175–179. [Google Scholar]
Qin, F.; Liu, D.; Sun, B.; Ruan, L.; Ma, Z.; Wang, H. Identification of alfalfa leaf diseases using image recognition technology. PLoS ONE 2016, 11, e0168274. [Google Scholar] [CrossRef] [PubMed]
Islam, M.; Dinh, A.; Wahid, K.; Bhowmik, P. Detection of potato diseases using image segmentation and multiclass support vector machine. In Proceedings of the 2017 IEEE 30th Canadian Conference on Electrical and Computer Engineering (CCECE), Windsor, ON, Canada, 30 April–3 May 2017; pp. 1–4. [Google Scholar]
Manavalan, R. Automatic identification of diseases in grains crops through computational approaches: A review. Comput. Electron. Agric. 2020, 178, 105802. [Google Scholar] [CrossRef]
Deng, R.; Tao, M.; Xing, H.; Yang, X.; Liu, C.; Liao, K.; Qi, L. Automatic diagnosis of rice diseases using deep learning. Front. Plant Sci. 2021, 12, 701038. [Google Scholar] [CrossRef]
Wang, X.; Liu, J.; Zhu, X. Early real-time detection algorithm of tomato diseases and pests in the natural environment. Plant Methods 2021, 17, 43. [Google Scholar] [CrossRef] [PubMed]
Elakya, R.; Manoranjitham, T. Classification of diseases in Paddy by using Deep transfer learning MobileNetV2 model. In Proceedings of the 2022 1st International Conference on Computational Science and Technology (ICCST), Chennai, India, 9–10 November 2022; pp. 936–940. [Google Scholar]
Asvitha, S.; Dhivya, T.; Dhivyasree, H.; Bhavadharini, R. Paddy Pro: A MobileNetV3-Based App to Identify Paddy Leaf Diseases. In Proceedings of the International Conference on Computing, Communications, and Cyber-Security, Delhi, India, 21–22 October 2022; pp. 203–216. [Google Scholar]
Wang, Z.; Wei, Y.; Mu, C.; Zhang, Y.; Qiao, X. Rice Disease Classification Using a Stacked Ensemble of Deep Convolutional Neural Networks. Sustainability 2024, 17, 124. [Google Scholar] [CrossRef]
Zhu, X.; Cheng, D.; Zhang, Z.; Lin, S.; Dai, J. An empirical study of spatial attention mechanisms in deep networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6688–6697. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
Rachman, R.K.; Setiadi, D.R.I.M.; Susanto, A.; Nugroho, K.; Islam, H.M.M. Enhanced vision transformer and transfer learning approach to improve rice disease recognition. J. Comput. Theor. Appl. 2024, 1, 446–460. [Google Scholar] [CrossRef]
Zhang, Z.; Gong, Z.; Hong, Q.; Jiang, L. Swin-transformer based classification for rice diseases recognition. In Proceedings of the 2021 International Conference on Computer Information Science and Artificial Intelligence (CISAI), Kunming, China, 17–19 September 2021; pp. 153–156. [Google Scholar]
Ding, Y.; Yang, W. Classification of apple leaf diseases based on MobileViT transfer learning. In Proceedings of the International Conference on Image Processing and Artificial Intelligence (ICIPAl 2024), Suzhou, China, 19–21 April 2024; pp. 384–390. [Google Scholar]
Thai, H.-T.; Le, K.-H. MobileH-Transformer: Enabling real-time leaf disease detection using hybrid deep learning approach for smart agriculture. Crop Prot. 2025, 189, 107002. [Google Scholar] [CrossRef]
Karim, M.J.; Goni, M.O.F.; Nahiduzzaman, M.; Ahsan, M.; Haider, J.; Kowalski, M. Enhancing agriculture through real-time grape leaf disease classification via an edge device with a lightweight CNN architecture and Grad-CAM. Sci. Rep. 2024, 14, 16022. [Google Scholar] [CrossRef]
Maeda-Gutiérrez, V.; Oropeza-Valdez, J.J.; Reveles-Gómez, L.C.; Padron-Manrique, C.; Resendis-Antonio, O.; Solís-Sánchez, L.O.; Guerrero-Osuna, H.A.; Olvera Olvera, C.A. AI-Driven Plant Health Assessment: A Comparative Analysis of Inception V3, ResNet-50 and ViT with SHAP for Accurate Disease Identification in Taro. Agronomy 2024, 15, 77. [Google Scholar] [CrossRef]
Hu, T.; Du, J.; Yan, K.; Dong, W.; Zhang, J.; Wang, J.; Xie, C. Causality-inspired crop pest recognition based on Decoupled Feature Learning. Pest Manag. Sci. 2024, 80, 5832–5842. [Google Scholar] [CrossRef]
Ai, Y.; Sun, C.; Tie, J.; Cai, X. Research on recognition model of crop diseases and insect pests based on deep learning in harsh environments. IEEE Access 2020, 8, 171686–171693. [Google Scholar] [CrossRef]
Shao, Y.; Yang, W.; Wang, J.; Lu, Z.; Zhang, M.; Chen, D. Cotton Disease Recognition Method in Natural Environment Based on Convolutional Neural Network. Agriculture 2024, 14, 1577. [Google Scholar] [CrossRef]
Zeng, N.; Gong, G.; Zhou, G.; Hu, C. An accurate classification of rice diseases based on icai-v4. Plants 2023, 12, 2225. [Google Scholar] [CrossRef]
Jepkoech, J.; Mugo, D.M.; Kenduiywo, B.K.; Too, E.C. Arabica coffee leaf images dataset for coffee leaf disease detection and classification. Data Brief 2021, 36, 107142. [Google Scholar] [CrossRef]
Mehta, S.; Rastegari, M. MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer. arXiv 2021, arXiv:2110.02178. [Google Scholar] [CrossRef]
Shaw, P.; Uszkoreit, J.; Vaswani, A. Self-attention with relative position representations. arXiv 2018, arXiv:1803.02155. [Google Scholar] [CrossRef]
Li, J.; Tu, Z.; Yang, B.; Lyu, M.R.; Zhang, T. Multi-head attention with disagreement regularization. arXiv 2018, arXiv:1810.10183. [Google Scholar] [CrossRef]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
El-Shafai, W.; Almomani, I.; AlKhayer, A. Visualized malware multi-classification framework using fine-tuned CNN-based transfer learning models. Appl. Sci. 2021, 11, 6446. [Google Scholar] [CrossRef]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.-C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
Yu, W.; Luo, M.; Zhou, P.; Si, C.; Zhou, Y.; Wang, X.; Feng, J.; Yan, S. Metaformer is actually what you need for vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10819–10829. [Google Scholar]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]
Lundberg, S. A unified approach to interpreting model predictions. arXiv 2017, arXiv:1705.07874. [Google Scholar] [CrossRef]
Botev, Z.I.; Kroese, D.P.; Rubinstein, R.Y.; L’Ecuyer, P. The cross-entropy method for optimization. In Handbook of Statistics; Elsevier: Amsterdam, The Netherlands, 2013; Volume 31, pp. 35–59. [Google Scholar]
Loshchilov, I. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar] [CrossRef]
Loshchilov, I.; Hutter, F. Stochastic gradient descent with warm restarts. In Proceedings of the 5th International Conference on Learning Representations, Toulon, France, 24–26 April 2017; pp. 1–16. [Google Scholar]
Islam, T.; Sah, M.; Baral, S.; Choudhury, R.R. A faster technique on rice disease detectionusing image processing of affected area in agro-field. In Proceedings of the 2018 Second International Conference on Inventive Communication and Computational Technologies (ICICCT), Coimbatore, India, 20–21 April 2018; pp. 62–66. [Google Scholar]
Kiratiratanapruk, K.; Temniranrat, P.; Kitvimonrat, A.; Sinthupinyo, W.; Patarapuwadol, S. Using deep learning techniques to detect rice diseases from images of rice fields. In Proceedings of the International Conference on Industrial, Engineering and other Applications of Applied Intelligent Systems, Kitakyushu, Japan, 22–25 September 2020; pp. 225–237. [Google Scholar]
Wang, Y.; Wang, H.; Peng, Z. Rice diseases detection and classification using attention based neural network and bayesian optimization. Expert Syst. Appl. 2021, 178, 114770. [Google Scholar] [CrossRef]
Ni, H.; Shi, Z.; Karungaru, S.; Lv, S.; Li, X.; Wang, X.; Zhang, J. Classification of typical pests and diseases of Rice based on the ECA attention mechanism. Agriculture 2023, 13, 1066. [Google Scholar] [CrossRef]
Jiang, M.; Feng, C.; Fang, X.; Huang, Q.; Zhang, C.; Shi, X. Rice disease identification method based on attention mechanism and deep dense network. Electronics 2023, 12, 508. [Google Scholar] [CrossRef]
Prajapati, H.B.; Shah, J.P.; Dabhi, V.K. Detection and classification of rice plant diseases. Intell. Decis. Technol. 2017, 11, 357–373. [Google Scholar] [CrossRef]
Sengupta, S.; Dutta, A.; Abdelmohsen, S.A.; Alyousef, H.A.; Rahimi-Gorji, M. Development of a rice plant disease classification model in big data environment. Bioengineering 2022, 9, 758. [Google Scholar] [CrossRef] [PubMed]
Islam, M.M.; Hasan, S.M.; Hossain, M.R.; Srizon, A.Y.; Akter, F.; Faruk, M.F.; Akter, S.S. Channel Attention-Guided Lightweight CNN Meets Extreme Learning Machine: Multi-Class Crop Leaf Disease Classification with Explainable AI. In Proceedings of the 2023 26th International Conference on Computer and Information Technology (ICCIT), Cox’s Bazar, Bangladesh, 13–15 December 2023; pp. 1–6. [Google Scholar]

Figure 1. Examples of rice diseases in the dataset: (a) rice brown spot, (b) rice bacterial leaf blight, (c) rice sheath blight, (d) rice bacterial leaf streak, (e) rice false smut, and (f) rice blast.

Figure 2. Structure of MobileViT. The MV2 structure with a downward arrow indicates a stride of 2, representing downsampling (adapted from [24]). C, H, W, d, N and P represent data size, X and Y are input and output, respectively.

Figure 3. Diagram of Unfold and Fold operations.

Figure 4. Structure of the ECA module. The value of k determines the size of the convolutional kernel, while C represents the number of input feature channels (adapted from [27]).

Figure 5. Structure of the CA module (adapted from [12]).

Figure 6. Structure of MobileViT-DAP: (a) MbECA module structure, (b) PoolFormer block structure, and (c) workflow of MobileViT-DAP.

Figure 7. Confusion matrix for rice disease image classification using the dual-attention enhanced MobileViT models: (a) MobileViT+CA+ECA and (b) MobileViT-DAP, with the addition of the PoolFormer blocks.

Figure 8. Accuracy and loss curves of different networks: (a) training accuracy curve, (b) testing accuracy curve, (c) training loss curve, and (d) testing loss curve.

Figure 9. Visualization of Grad-CAM results. Each row represents a different rice disease class, and each column corresponds to a model with a different attention mechanism. The red areas indicate the regions that the model primarily focuses on.

Figure 10. Visualization of SHAP values for rice disease images using MobileViT-DAP. Each row represents a different rice disease class, and each column displays the SHAP values when classifying the sample into the specified category. The SHAP values indicate the impact of each pixel on the model’s prediction, with blue representing a negative impact and red representing a positive impact.

Figure 11. T-SNE visualization of test set features using MobileViT-DAP, where points that are closer together indicate better classification performance.

Figure 12. Symptoms of rice bacterial leaf streak in the late stage and rice bacterial leaf blight in the early stage: (a) rice bacterial leaf streak (10021) and (b) rice bacterial leaf blight (10017).

Figure 13. Performance comparison of different models for rice disease image classification.

Table 1. Detailed information on the rice disease dataset.

Label	Disease Name	Pathogen	Sample Size	Training and Validation Sets	Testing Set
10000	Rice brown spot	Bipolaris oryzae (Breda de Haan) Shoem.	2684	2148	536
10017	Rice bacterial leaf blight	Xanthomonas oryzae pv.oryza (Ishiyama) Zoo	980	784	196
10018	Rice sheath blight	Rhizoctonia solani Kühn	1496	1197	299
10021	Rice bacterial leaf streak	Xanthomonas oryzae pv.oryzicola (Fang et al.) Swings	1496	1197	299
10046	Rice false smut	Ustilaginoidea virens (Cooke) Takah.	1641	1313	328
10047	Rice blast	Pyricularia oryzae Cavara	3366	2693	673
Total			11,663	9332	2331

Table 2. Comparison of performance across different backbone networks.

Model	Params ¹	FLOPs ²	Accuracy	Precision	Recall	Specificity
EfficientNet-B0	4.02 M	0.40 G	0.9348	0.9371	0.9158	0.9861
MobileNetV2	2.23 M	0.31 G	0.9532	0.9515	0.9469	0.9902
MobileNetV3-Small	1.52 M	0.06 G	0.9224	0.9181	0.9012	0.9838
MobileNetV3-Large	4.21 M	0.22 G	0.9387	0.9384	0.9237	0.9872
MobileViT-XXS	0.95 M	0.26 G	0.9699	0.9650	0.9664	0.9939

¹ Params (parameters) represents the number of learnable parameters in the model; ² FLOPs (floating-point operations) represent the model’s computational complexity.

Table 3. Results of the ablation experiments.

CA	ECA	Params	FLOPs	Accuracy	Precision	Recall	Specificity
×	×	0.95 M	0.26 G	0.9699	0.9650	0.9664	0.9939
√	×	0.96 M	0.26 G	0.9692	0.9636	0.9630	0.9937
×	√	0.95 M	0.26 G	0.9867	0.9854	0.9828	0.9973
√	√	0.96 M	0.26 G	0.9966	0.9970	0.9961	0.9993

Table 4. Classification performance metrics of MobileViT-DAP on six rice disease classes.

Label	Precision	Recall	Specificity
10000	0.9944	0.9963	0.9983
10017	0.9949	0.9949	0.9995
10018	1.0000	0.9967	1.0000
10021	0.9933	0.9933	0.9990
10046	1.0000	0.9970	1.0000
10047	0.9955	0.9970	0.9982

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, M.; Lin, Z.; Tang, S.; Lin, C.; Zhang, L.; Dong, W.; Zhong, N. Dual-Attention-Enhanced MobileViT Network: A Lightweight Model for Rice Disease Identification in Field-Captured Images. Agriculture 2025, 15, 571. https://doi.org/10.3390/agriculture15060571

AMA Style

Zhang M, Lin Z, Tang S, Lin C, Zhang L, Dong W, Zhong N. Dual-Attention-Enhanced MobileViT Network: A Lightweight Model for Rice Disease Identification in Field-Captured Images. Agriculture. 2025; 15(6):571. https://doi.org/10.3390/agriculture15060571

Chicago/Turabian Style

Zhang, Meng, Zichao Lin, Shuqi Tang, Chenjie Lin, Liping Zhang, Wei Dong, and Nan Zhong. 2025. "Dual-Attention-Enhanced MobileViT Network: A Lightweight Model for Rice Disease Identification in Field-Captured Images" Agriculture 15, no. 6: 571. https://doi.org/10.3390/agriculture15060571

APA Style

Zhang, M., Lin, Z., Tang, S., Lin, C., Zhang, L., Dong, W., & Zhong, N. (2025). Dual-Attention-Enhanced MobileViT Network: A Lightweight Model for Rice Disease Identification in Field-Captured Images. Agriculture, 15(6), 571. https://doi.org/10.3390/agriculture15060571

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Dual-Attention-Enhanced MobileViT Network: A Lightweight Model for Rice Disease Identification in Field-Captured Images

Abstract

1. Introduction

2. Materials and Methods

2.1. Image Data Collection

2.2. Dataset Construction and Preprocessing

2.3. MobileViT Network Structure

2.4. Attention Module

2.4.1. Efficient Channel Attention

2.4.2. Coordinate Attention

2.5. Transfer Learning of Rice Disease Identification Using MobileViT-DAP

2.6. Evaluation Standard

2.7. Interpretability Methods

3. Results

3.1. Experimental Platform and Parameter Settings

3.2. Comparative Experiment Analysis of Lightweight Backbone Networks

3.3. Network Improvement Ablation Experiments

3.4. Experimental Analysis of MobileViT-DAP

3.5. Interpretability Analysis

3.5.1. Grad-CAM

3.5.2. SHAP

3.5.3. T-SNE

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI