1. Introduction
Rice, one of the most important staple crops worldwide, is the primary food source for the Asian population. However, during its growth cycle, rice is often affected by diseases such as rice blast, rice sheath blight, and rice bacterial leaf blight, which significantly impact both yield and quality. Traditional plant disease diagnosis methods mainly rely on manual observation and machine learning models that necessitate manual feature extraction [
1,
2,
3]. Manual observation is not only time-consuming and labor-intensive but also limited by the observer’s technical expertise and experience, which can lead to misjudgments and omissions. Additionally, manual feature extraction involves complex preprocessing and segmentation steps tailored to different disease types and severity levels, making it difficult to capture the underlying patterns in images [
4]. Therefore, rapid and accurate identification of rice diseases is crucial for implementing effective prevention and control measures, ensuring food security, and promoting sustainable agricultural development.
With the rapid advancement of computer vision technology, deep learning-based rice disease identification methods have emerged as promising solutions [
5]. Convolutional neural networks (CNNs), a machine learning technique involving multi-layer neural networks, boast exceptional expressiveness and generalization capabilities, enabling the automatic extraction of high-level semantic features from large datasets [
6]. By collecting image data from rice leaves or whole plants and using deep learning models to extract features, accurate rice disease identification can be achieved. However, the large parameter sizes and computational costs of these models present challenges for their deployment in resource-constrained agricultural environments. To address these limitations, lightweight CNN models, such as MobileNet and EfficientNet, have been proposed. These models were specifically chosen because of their proven ability to balance high accuracy with reduced computational requirements, making them ideal for mobile and embedded applications in agriculture. For instance, Elakya et al. [
7] achieved 98.73% accuracy in classifying rice diseases using an enhanced MobileNetV2 model, while maintaining a lightweight architecture. Asvitha et al. [
8] developed a smartphone-based rice disease and pest identification system using the MobileNetV3 framework, achieving an accuracy of 93.75%, which helps reduce crop losses. Wang et al. [
9] proposed a lightweight EfficientNet model ensemble method, achieving an accuracy of 96.10% in classifying five types of rice tissue images while improving computational efficiency compared to traditional CNN models.
Despite these advancements, lightweight models still encounter challenges in complex field environments and with overlapping disease symptoms. Attention mechanisms have significantly enhanced the performance of deep learning models in plant disease image recognition by improving feature representation and focusing on disease-relevant regions. For instance, spatial attention mechanisms prioritize key areas within an image, improving lesion region localization and enhancing classification accuracy [
10]. Channel attention mechanisms, such as Squeeze-and-Excitation (SE) blocks, refine inter-channel features by assigning higher weights to informative channels, thus improving performance in complex classification tasks [
11]. Coordinate attention (CA), which combines spatial and channel information along coordinate axes, has been effectively utilized in lightweight CNN architectures to improve accuracy while maintaining computational efficiency [
12]. Additionally, Transformer-based models, such as Vision Transformer (ViT) and Swin Transformer, have emerged as powerful alternatives, leveraging self-attention mechanisms to model long-range dependencies and enhance robustness in complex scenarios. Rachman et al. [
13] demonstrated the effectiveness of ViT in classifying rice diseases under diverse field conditions, while Zhang et al. [
14] utilized the Swin Transformer for high-precision disease detection, outperforming CNN models. To further improve accuracy and efficiency in plant disease recognition, recent studies have integrated CNN with Transformer architectures. Ding et al. [
15] developed a model integrating deep transfer learning, achieving high accuracy in apple leaf disease classification with efficient computational requirements, while Thai et al. [
16] combined CNN’s local feature extraction with the Transformer’s global learning, demonstrating strong performance on multiple crop datasets while ensuring real-time processing. These approaches highlight the potential of hybrid models for efficient and accurate disease detection.
Although deep learning models achieve high accuracy in plant disease classification, they are often criticized for their “black-box” nature, which limits interpretability, raises trust concerns, and complicates error diagnosis in critical applications such as agriculture. This lack of transparency prevents users from understanding how models make decisions, particularly in cases of incorrect predictions or overlapping disease classes. To address these challenges, various visualization techniques have been developed to offer insights into model behavior. Karim et al. [
17] developed a grape leaf disease classification model based on MobileNetV3-Large and used the Gradient-weighted Class Activation Mapping (Grad-CAM) technique to identify and highlight the image regions involved in the model’s decision-making process. Maeda-Gutiérrez et al. [
18] developed multiple deep learning models for taro disease image classification and used SHapley Additive exPlanations (SHAP) to explain the models, revealing distinct classification strategies and identifying key features, such as leaf veins and color patterns, that the models rely on for disease prediction. Hu et al. [
19] introduced a decoupled feature learning framework, leveraging causal inference techniques to build a pest image classification model, and used t-Distributed Stochastic Neighbor Embedding (t-SNE) to visualize the model’s feature extraction performance, demonstrating improved inter-class separation and reduced intra-class variance. These techniques bridge the gap between model decisions and human understanding, enhancing trust and enabling more reliable deployment in real-world applications.
However, despite the proven effectiveness of attention mechanisms in enhancing deep learning models, certain limitations limit their performance in complex environments. While mechanisms like CA and Efficient Channel Attention (ECA) effectively improve feature selection, a single attention mechanism may not be sufficient for tasks with intricate backgrounds, such as rice disease classification in field conditions. The diversity in leaf shapes, disease spots, and complex field backgrounds presents significant challenges for these mechanisms in capturing all relevant spatial dependencies [
20]. Additionally, while Grad-CAM and SHAP provide valuable insights into model decisions, they also have limitations. Grad-CAM, although useful for highlighting relevant image regions, lacks the depth required for a comprehensive understanding of decision-making processes. Similarly, SHAP values assist in feature attribution but may not fully capture the complex interactions between features, particularly in images with high variability.
In conclusion, despite advancements in rice disease identification, many existing methods are hindered by their reliance on a single attention mechanism, which fails to capture sufficient disease features. This limitation not only affects the accuracy of detection but also increases computational complexity. This paper proposes an enhanced MobileViT model, named MobileViT-DAP, that improves the accuracy of rice disease identification while reducing model parameters and computational complexity, thus enabling efficient, accurate, and non-destructive identification. Our contributions are summarized as follows: (1) a high-quality rice disease image dataset was developed from field environments to address data scarcity; (2) the MobileViT-DAP model was enhanced by integrating CA and ECA mechanisms, which improved the interaction between spatial and channel features; and (3) multiple visualization techniques, including Grad-CAM, SHAP, and t-SNE, were employed to improve model interpretability and transparency.
2. Materials and Methods
2.1. Image Data Collection
The rice disease image dataset used in this study was primarily compiled by the Anhui Academy of Agricultural Sciences. Image acquisition was conducted using six high-performance digital cameras, including models such as the D810, D800, and D750 (Nikon, Tokyo, Japan), as well as several smartphone devices. The data collection period spanned from 2015 to 2023, covering various regions of China, including Anhui, Guangdong, and Hunan provinces. Images were captured under diverse conditions, including both natural and artificial lighting, and in settings with both complex and simple backgrounds. Efforts were made to ensure the diversity of the images. Each image had a resolution of at least 3000 × 2000 pixels and was saved in JPG format. Disease classification labels were manually verified by plant protection experts from the Anhui Academy of Agricultural Sciences, indicating both the disease type and its corresponding identification number. Based on factors such as frequency of occurrence, the affected area, and the number of available images, six common rice diseases were selected for this study: rice brown spot, rice bacterial leaf blight, rice sheath blight, rice bacterial leaf streak, rice false smut, and rice blast. The typical symptoms of these six diseases are shown in
Figure 1.
Among the six rice diseases studied in this paper, rice brown spot, rice bacterial leaf blight, and rice bacterial leaf streak primarily affect the leaves, while rice false smut affects the panicle. These diseases were used to compare the disease image features from different plant parts. Rice sheath blight and rice blast can occur on the stems, leaf sheaths, or leaves. Diseases from different plant parts were grouped together to improve the model’s robustness and generalizability. Additionally, regarding lesion shapes, rice bacterial leaf blight, and rice bacterial leaf streak primarily exhibit elongated lesions, while the other diseases are characterized by irregular spots and patches. Furthermore, since the disease images used in this study were collected in batches from the field, covering multiple stages of the disease cycle, the lesion characteristics and contextual information became diversified. Factors such as camera parameters and light intensity may also affect the classification model’s performance, presenting challenges to this study.
In this study, we did not include a set of healthy rice leaf images for classification. The primary reason for this decision is that healthy leaves are relatively easy to identify compared to diseased ones and do not require specialized knowledge for accurate classification. Moreover, in datasets that include both healthy and diseased samples, the healthy category typically achieves higher accuracy, indicating that their exclusion has minimal impact on overall model performance [
21,
22]. Furthermore, in the context of the visualization analysis, the key evaluation criterion is the overlap between the model’s attention regions and the disease spots. Since healthy leaves lack disease lesions, their inclusion could potentially interfere with the interpretability analysis. Therefore, to focus on enhancing disease detection and understanding model behavior in the presence of disease lesions, only diseased leaf images were included in the dataset.
2.2. Dataset Construction and Preprocessing
The dataset was randomly divided into training, validation, and testing sets at a ratio of 7:1:2 [
23]. After image annotation, 2331 images captured under various weather and light conditions were selected as the test set, and the remaining 9332 images were used to train the network. The 8-fold cross-validation approach was employed for model training, with the validation set used to fine-tune the model parameters. The specific details of rice disease images are shown in
Table 1.
Due to the large size of the original images, preprocessing was performed. For example, the preprocessing steps were applied to input images of size h × w, where h and w represent the height and width of the image, respectively. First, we resized the input image to h’ × w’, maintaining the original aspect ratio. Initially, the smaller of h and w was resized to 640 pixels. Then, the larger dimension was adjusted by multiplying it by the ratio of the larger value to the smaller one. Additionally, data augmentation was applied to the original images to meet the requirements of rice disease identification tasks. Data augmentation enhanced the model’s robustness and accuracy through random cropping, horizontal flipping, and center cropping. In this study, we used the PyTorch 1.12.1 framework and Torchvision for data augmentation.
2.3. MobileViT Network Structure
MobileViT is a lightweight hybrid model that combines CNN and ViT to achieve high computational efficiency and strong performance in visual tasks. The model uses convolutional layers to extract local features, capturing fine-grained spatial information such as edges and textures. These features are then passed through Transformer blocks, which capture long-range dependencies and global context, enabling an effective understanding of complex image patterns. The key strength of MobileViT lies in its efficiency. By combining convolutional operations with Transformer layers, the model reduces computational costs compared to traditional ViT models, making it ideal for deployment on resource-constrained devices such as mobile phones and embedded systems. This hybrid design also allows for multi-scale feature extraction, enabling the model to capture both local details and global context, making it effective in tasks such as image classification.
Figure 2 illustrates the structure of MobileViT, which consists primarily of depthwise convolution layers, MV2 layers (inverted residual blocks from MobileNetV2), MobileViT blocks, global pooling, and fully connected layers. The most important and central component is the MobileViT block, whose workflow begins with the application of a 3 × 3 convolution layer for local feature representation. Next, a 1 × 1 convolution layer is applied to adjust the number of channels. Subsequently, the Unfold, Transformer, and Fold structures are applied for global feature modeling. A 1 × 1 convolution layer is then applied to restore the number of channels to its original size. Finally, a shortcut branch is introduced to connect the feature map along the channel dimension with the original input feature map, followed by a 3 × 3 convolution layer for feature fusion to produce the output.
A critical step in the workflow of MobileViT involves replacing local modeling in convolution with global modeling using Transformer. This requires the Unfold and Fold operations to reshape the data into the format necessary for computing self-attention [
25,
26]. As shown in
Figure 3, during the self-attention computation within the MobileViT block, only tokens of the same color are considered for self-attention calculation to reduce computational complexity. The Unfold operation flattens tokens of the same color into a sequence, allowing parallel computation of each sequence using the original self-attention mechanism. The Fold operation then folds the sequences back into the original feature map. Through this process, MobileViT effectively integrates both local and global features, ultimately performing feature fusion via a convolutional layer to produce the output. This design reduces computational demands while maintaining high performance.
2.4. Attention Module
2.4.1. Efficient Channel Attention
ECA mechanism is a lightweight and computationally Efficient Channel Attention module designed to enhance the representational power of CNNs. Unlike traditional methods, ECA uses a 1D convolution to capture local cross-channel dependencies without dimensionality reduction, enabling the model to adaptively focus on important channels. This approach improves performance with minimal computational cost, making it ideal for real-time applications and resource-constrained environments. While ECA excels at enhancing channel-wise features, its ability to capture global dependencies may be limited, reducing its effectiveness in tasks requiring extensive global context. Overall, ECA strikes a balance between efficiency and performance, offering a robust solution for various visual tasks.
As shown in
Figure 4, the ECA mechanism replaces the fully connected layers in the original SENet module with a 1 × 1 convolutional layer, which directly learns from the features obtained after global average pooling. An adaptive method is then employed to select the convolution kernel size for each channel, followed by a 1D convolutional layer. Weights are obtained through a sigmoid function and multiplied back onto the original feature map to produce the final output. The size of the 1 × 1 convolution kernel in ECA determines the extent of cross-channel interaction and the number of channels involved in each weight calculation, making it a critical factor. ECA mechanism efficiently captures channel-wise relationships in convolutional feature maps, enhancing feature extraction and representation learning.
2.4.2. Coordinate Attention
A CA mechanism is designed to capture relationships between different positions in an image, considering that each position contains unique and important coordinate information. It captures attention across the image’s width and height, encoding precise positional information. CA offers advantages such as efficient spatial feature integration, low computational cost, and improved accuracy in tasks like object detection and segmentation. However, its performance may degrade when global spatial context is required across large feature maps, and it may not capture long-range dependencies as effectively as self-attention mechanisms. Despite these limitations, CA remains a powerful tool for enhancing spatial feature representation in convolutional networks.
As shown in
Figure 5, the CA mechanism performs global average pooling separately along the width and height of the input feature map, and then concatenates and reduces their dimensions using a convolutional layer. After applying batch normalization and a nonlinear activation, a 1 × 1 convolution followed by a sigmoid function generates two attention weight matrices, which are multiplied to form the final attention map. This map is then added back to the original feature map. CA mechanism introduces minimal computational overhead, preserving the model’s lightweight nature. By incorporating positional information, CA enhances spatial selectivity, captures long-range dependencies, and improves accuracy and generalization, thus making a significant contribution to feature extraction and representation learning in image analysis.
2.5. Transfer Learning of Rice Disease Identification Using MobileViT-DAP
Transfer learning is a machine learning technique that transfers existing knowledge from one domain to another to improve the learning performance in the target domain [
28]. In image identification, transfer learning can leverage vast amounts of data and pre-trained models to learn new image identification tasks without the need to train complex neural networks from scratch. This approach saves time and computational resources while improving the model’s generalization capabilities and accuracy. In this paper, transfer learning is employed for model training.
Single attention mechanisms may not fully capture the complexities of classification tasks, especially those involving intricate patterns or fine-grained feature interactions. In this context, CA excels in spatial encoding by effectively capturing positional information but struggles to model inter-channel dependencies. On the other hand, ECA focuses on efficient cross-channel refinement, optimizing channel-wise attention but lacks strong spatial localization capabilities. To address these limitations, we enhance MobileViT by incorporating a dual-attention mechanism that combines both CA and ECA mechanisms. This modification aims to improve the model’s ability to capture both fine-grained spatial features and inter-channel dependencies. MobileViT-XXS utilizing the smallest pre-trained weights from the MobileViT, serves as the foundation for this enhancement, resulting in a more efficient and effective model suitable for complex classification tasks.
Figure 6c presents the overall architecture of MobileViT-DAP. The main modifications include replacing the MV2 module with the MbECA module, substituting the last two MobileViT blocks with the PoolFormer module, and adding a CA module after the final 1 × 1 convolutional layer. As shown in
Figure 6a, the MbECA module is implemented by adding an ECA module to the MV2 module [
29]. Specifically, its input feature map first passes through a 1 × 1 convolutional layer, which increases the number of channels, followed by an optional downsampling depthwise separable convolution operation consisting of a 3 × 3 channel-wise convolutional layer and a 1 × 1 point-wise convolutional layer. Subsequently, the number of channels is reduced through another 1 × 1 convolutional layer, and the ECA module is incorporated. Finally, a residual connection is applied. As illustrated in
Figure 6b, the PoolFormer architecture is simple yet achieves highly competitive performance, demonstrating that structural simplification can be implemented without compromising model performance [
30]. The added CA module at the end is designed to capture additional spatial positional feature information. Overall, the model enhances spatial feature perception through CA, addressing the limitations of spatial information in MobileViT. It strengthens channel feature relationships through ECA, improving the ability to capture finer details, and further enhances the model’s lightweight and efficient characteristics with PoolFormer.
2.6. Evaluation Standard
When evaluating model performance, we selected precision, recall, specificity, and accuracy as evaluation metrics. Precision represents the proportion of correct positive predictions out of all positive predictions. Recall represents the proportion of correct positive predictions out of all positive events. Specificity represents the probability of correctly predicting negative samples out of all negative samples, assessing the model’s ability to identify negative class samples. Accuracy represents the proportion of correct predictions out of all predicted samples.
Among these metrics, TP (true positives) refers to the number of actual positive samples predicted as positive by the model, TN (true negatives) refers to the number of actual negative samples predicted as negative, FP (false positives) refers to the number of samples predicted as positive but are actually negative, and FN (false negatives) refers to the number of samples predicted as negative but are actually positive.
2.7. Interpretability Methods
Visualization techniques, such as Grad-CAM, SHAP, and t-SNE, are crucial for improving the interpretability of machine learning models by offering insights into how models make decisions. These techniques map the model’s internal representations and decision-making process to visual cues, helping identify the features or regions in the input data that contribute most to predictions.
Grad-CAM is a powerful technique for visualizing which regions of an image contribute most to a model’s decision by generating class-specific heatmaps. It calculates the gradients of the target class with respect to the final convolutional layer and uses these gradients to weight the feature maps, highlighting discriminative areas in the input image [
31]. Grad-CAM is non-invasive and model-agnostic, making it applicable to a wide range of deep learning models, including those used for image classification. It enhances the interpretability of complex models by providing spatial attention maps, facilitating a better understanding of the decision-making process. However, its reliance on the final convolutional layer may lead to imprecise localization, and the generated heatmaps can be noisy, particularly in fine-grained tasks.
SHAP is a game theory-based method that provides model-agnostic explanations by assigning each feature a contribution value based on its impact on the model’s prediction. It ensures fairness and consistency by calculating Shapley values, which sum to the model’s output, and is particularly useful for local interpretability, providing insights into individual predictions [
32]. SHAP’s advantages include its ability to explain complex black-box models and its theoretical foundation, making it highly reliable for feature importance analysis. However, its computational complexity can be a limitation, particularly for models with many features, and its effectiveness depends on the quality of the data and the model.
T-SNE is a nonlinear dimensionality reduction technique used to visualize high-dimensional data in lower dimensions. It minimizes the divergence between probability distributions representing pairwise similarities in both high-dimensional and low-dimensional spaces, preserving local structures. T-SNE is particularly effective in revealing clusters and complex relationships, making it valuable for exploratory data analysis and model evaluation. However, it has limitations, including high computational complexity, sensitivity to hyperparameters, and poor preservation of global structures.
Each technique has its limitations. No single method can fully explain a model’s decision-making process. Grad-CAM may lack precision in deeper layers, SHAP can be computationally expensive for large datasets, and t-SNE may fail to preserve global structures effectively. These limitations underscore the need to use a combination of these techniques, allowing for a more comprehensive and complementary understanding of model behavior.
4. Discussion
Accurate identification of rice diseases is essential for effective disease management and yield protection. However, rice disease recognition in real-world field conditions presents significant challenges due to its inherent complexity. This complexity arises from several factors. First, inter-class symptom similarity makes it difficult to distinguish between different diseases, as many pathogens cause visually overlapping lesions or discolorations. Second, within the same disease category, intra-class variability exists due to the progression of symptoms across different growth stages, leading to variations in lesion size, color, and texture. Additionally, the diversity of infection sites on rice plants—including leaves, stems, and panicles—further complicates accurate classification. Moreover, the complexity of field environments exacerbates these challenges, as factors such as varying background conditions, fluctuating light intensity, and occlusions caused by overlapping leaves or other plants introduce noise and inconsistencies in image data [
36,
37]. These issues collectively highlight the need for robust, adaptable models capable of handling the diverse visual characteristics, and environmental variability inherent in rice disease detection tasks.
Compared to MobileViT-XXS, MobileViT-DAP integrates CA, ECA, and PoolFormer modules (
Figure 6). These enhancements enable efficient deployment on resource-constrained edge devices, providing a high-performance real-time solution for rice disease identification. Compared to studies [
38,
39,
40], which introduced a single attention module to enhance model performance, this study incorporates a dual-attention mechanism, resulting in greater improvements in classification performance. While CA enhances spatial feature encoding by capturing positional dependencies, it has limited capability in modeling inter-channel relationships. Conversely, ECA efficiently refines channel-wise dependencies but lacks spatial localization ability. The combination of both mechanisms enables the model to capture fine-grained spatial details and robust channel interactions simultaneously, improving feature representation and classification performance. Additionally, the integration of the PoolFormer module further reduces the model’s complexity without compromising performance, while incorporating model complexity and computational efficiency as key evaluation metrics to demonstrate the superiority of the proposed method.
MobileViT-DAP proposed in this paper reduces the Params by 21% compared to the original MobileViT-XXS while achieving improvements in classification performance. As shown in
Table 2 and
Figure 8, compared to traditional lightweight models, such as EfficientNet and MobileNet, MobileViT-DAP demonstrates a unique trade-off between model size, computational efficiency, and classification performance. For example, MobileNetV3-Small achieves the lowest FLOPs (0.06 G vs. 0.23 G) compared to MobileViT-DAP but falls short in both model size and classification performance. Other lightweight models achieve higher classification performance at the cost of increased model size and computational complexity, yet they still lag behind MobileViT-DAP’s performance gains. Moreover, MobileViT-DAP exhibits appropriate inference times (5.15 ms) and a low memory footprint (3.03 MB), underscoring its suitability for real-world deployments in resource-constrained agricultural environments.
Figure 13 provides a more intuitive comparison of the classification performance of different models on the rice disease image dataset constructed in this study. The results indicate that MobileViT-DAP achieves the highest accuracy, precision, recall, and specificity, with values approaching 1.0, demonstrating its excellent classification performance and robustness. Furthermore, to mitigate overfitting risks associated with a more intricate design, we have implemented L2 regularization strategies and performed extensive cross-validation. These measures ensure that the performance gains are achieved without compromising the model’s generalization capabilities. Overall, despite adopting a lightweight architecture, MobileViT-DAP maintains outstanding performance, highlighting its efficiency and suitability for real-world applications.
Deep learning models often face interpretability challenges due to their black-box nature. In many earlier studies, the interpretability of models was not extensively explored due to the limitations of visualization techniques available at that time [
41,
42]. Consequently, it was challenging to ascertain whether the models’ decisions were based solely on the intended target features or were inadvertently influenced by extraneous factors such as background information or image parameters. This lack of transparency can undermine trust in the model’s predictions. For instance, as shown in the second column of
Figure 9, although MobileViT achieves correct classification, the regions it focuses on have low overlap with the actual lesion areas, raising concerns about the model’s reliability. To address this, we employed Grad-CAM and SHAP for model explanation. Grad-CAM highlights key image regions influencing predictions but is limited by its reliance on convolutional features, resulting in low-resolution heatmaps and reduced effectiveness in models with non-convolutional architectures [
43]. In contrast, SHAP provides granular feature-level attributions, yet its computational cost is high for complex models, and it assumes feature independence, which may lead to biased interpretations in datasets with correlated features [
18]. It is important to note that while the combined use of Grad-CAM and SHAP significantly enhances the interpretability of our model, the increased architectural complexity of MobileViT-DAP introduces inherent challenges. These challenges may limit the overall transparency of the decision-making process compared to simpler models, representing a trade-off between achieving high classification performance and maintaining model interpretability. In this study, we combined Grad-CAM’s spatial visualization with SHAP’s feature importance analysis for a comprehensive understanding of model decisions. Additionally, t-SNE was applied to visualize high-dimensional feature distributions, offering further insights into the model’s learned representations.
Although MobileViT-DAP demonstrated excellent performance in the experiments, it still has certain limitations. First, the experiments focused on data within a limited scope, and the model’s generalizability requires further validation and optimization across more diverse application scenarios. While this study constructed a large, high-quality rice disease image dataset, establishing similarly scaled datasets for other applications may not be feasible. Therefore, applying the model to small-sample classification tasks remains an area for further exploration. Additionally, the efficient attention mechanism enables the model to focus on lesion regions to improve its reliability. However, for samples with similar lesion features, it may increase the likelihood of misclassification. As shown in
Figure 12, late-stage rice bacterial leaf streak was misclassified as early-stage rice bacterial leaf blight, highlighting the challenge of distinguishing between these similar symptoms. Moreover, the model’s performance in handling unseen diseases or environmental variations has not been thoroughly evaluated, which could affect its robustness in real-world conditions. Another relevant factor that may contribute to misclassification is nutrient deficiency, particularly potassium deficiency. It can cause leaf discoloration and necrotic margins, which closely resemble bacterial leaf blight and may lead to false positives in classification. Future research should focus on enhancing the model’s ability to capture fine-grained features and extend its attention to incorporate some contextual information, thereby improving its discriminatory power and reducing the likelihood of misclassification in complex classification tasks.
5. Conclusions
In this study, we constructed a large, high-quality rice disease image dataset to support the development and evaluation of deep learning models for plant disease classification. Based on the MobileViT architecture, we proposed an improved model named MobileViT-DAP by integrating CA, ECA, and PoolFormer blocks. The experimental results demonstrated that, compared to various lightweight models, the improved model achieved significant enhancements in terms of model complexity, computational efficiency, and classification performance. Specifically, compared to the baseline model, MobileViT-DAP achieved an impressive reduction of 21% in Params, while maintaining high performance with an accuracy of 99.61%, a precision of 99.64%, a recall of 99.59%, and a specificity of 99.92%, effectively balancing lightweight design and high accuracy. Furthermore, the visualization analysis revealed that the model’s attention regions during the decision-making process highly overlapped with the actual lesion areas, indicating that the model primarily relies on disease-specific features for classification. This not only enhances the trustworthiness of the model’s predictions but also further validates its superior performance. Overall, this work offers a novel perspective for optimizing plant disease recognition tasks. In future work, we aim to expand the model’s application scenarios and further optimize its performance across diverse agricultural environments.