1. Introduction
Wheat is the world’s most important cereal crop, being directly intertwined with humanity’s survival and advancement [
1]. Common wheat diseases include wheat rust [
2,
3,
4], wheat powdery mildew [
5], wheat smut [
6], and wheat scab [
7]. These diseases significantly reduce the quality and yield of wheat, resulting in substantial economic losses. Therefore, the rapid and accurate detection and identification of wheat diseases are vital measures to ensure healthy wheat growth and safeguard agricultural security [
8].
Deep learning (DL) approaches have been frequently employed in identification tasks in the agricultural field and have produced remarkable outcomes, allowing for high identification accuracy at a relatively low cost [
9]. However, one of the biggest challenges when using DL for agricultural recognition tasks is image recognition accuracy. Real-world agricultural production images are often affected by complex backgrounds, adverse weather conditions, focus blurring, occlusion, and the presence of irrelevant objects [
10]. Additionally, the image quality can be significantly impacted by the image’s size, position, and shape, as well as the lighting and shooting conditions. These variables are the primary causes of classification errors. Therefore, improving image recognition capability is a significant issue in environments with complex backgrounds.
At present, attention mechanisms are widely used to improve model performance [
11]. Attention mechanisms are a technique used in deep learning to simulate selective attention and weighted processing of input data, similarly to processes in the human visual system [
12]. In a traditional neural network model, each input is usually treated equally without difference. However, in the actual task, the input data require different levels of attention for different components or regions. This notion is particularly crucial in agricultural classification tasks with complex backgrounds. When identifying specific crop diseases, it is crucial to focus on regions that exhibit distinct disease symptoms. Therefore, an attention mechanism is needed to efficiently allocate attention to relevant areas through autonomous learning and weight adjustment. In addition, through the use of Grad-CAM, attention mechanisms are also easy to interpret, effectively describing the key features or areas emphasized by the model to facilitate the decision-making process [
13]. For large agricultural objects, deep learning models are now superior to other classification methods; however, their classification performance may fall short of expectations in the initial stages of symptom appearance and when considering diseases with similar symptoms. Therefore, this disadvantage could be somewhat compensated for with the introduction of an attention mechanism.
Attention mechanisms have been introduced into the field of computer vision, imitating the ability of the human visual system to focus on salient regions in complex scenes and categorize those regions according to various approaches, such as channel attention, spatial attention, temporal attention, branch attention, and so on [
14]. Spatial attention and channel attention are often used in deep learning. Their main difference lies in whether the mechanism focuses on a specific spatial region of the input data or a specific channel when dealing with images or feature maps. The channel attention mechanism enhances the model’s ability to use multi-channel feature information and representations through allowing for weighted processing and selective attention to input channel dimensions. Meanwhile, the spatial attention mechanism helps to carry out weighted processing and apply selective attention to the input spatial dimensions, and can effectively use the feature information of different locations to improve the perceptual range and accuracy of the model. Common attention mechanisms include the squeeze-and-excitation (SE) module [
15], channel attention (CA) module [
16], efficient channel attention (ECA) module [
17], convolutional block attention module (CBAM) [
18], and simple parameter-free attention module (SimAM) [
19]. Plug-and-play attention mechanisms can be easily integrated into pre-existing models, allowing for significant improvements in accuracy with very few additional parameters. At present, integrating spatial attention, channel attention, or both, is a major method for improving model performance [
20].
In order to identify crop diseases swiftly and accurately, Genaev et al. [
21] proposed a method for the recognition of five fungal diseases of wheat shoots based on EfficientNetB0. An approach based on an image hashing algorithm was used to reduce the degradation of the training data. The highest accuracy of the model regarding the used data set was 94.20%. Nigam et al. [
22] created a data set called WheatRust21 and used a fine-tuned EfficientNetB4 to achieve 99.35% test accuracy on this data set. Nigam et al. [
23] combined the attention mechanism with the EfficientNetB0 model to detect the WheatRust21 image data set and obtained a test set accuracy of 98.70%. Cheng et al. [
24] proposed a lightweight crop disease image recognition model, DSGIResNet_AFF, based on attention feature fusion. This model was superior to other network models, and its parameters and number of floating point operations were fewer than those of the original model, with an accuracy of 98.30%, which was suitable for mobile devices. Zhao et al. [
25] proposed a model called DTL-SE-ResNet50, which integrates the SE module into ResNet50 based on dual-transfer learning to achieve vegetable disease recognition under simple and complex backgrounds, and performed better than the traditional model. The system could identify vegetable diseases quickly with a short detection time and high accuracy, compared with dtl-cam-resnet50 and DTL-SA-ResNet50. A network that deeply integrated the SE module into the ShuffleNetV2 network was constructed by Xu et al. [
26]. The accuracy of the model was 4.85% higher than that of the original model. Yang et al. [
27] established a model named DGLNet to solve the problems related to background noise and the dispersed distribution of disease symptoms in real environments. The model combined the Global Attention Module (GAM) and the Dynamic Representation Module (DRM). The results showed that the recognition accuracy of DGLNet reached 99.82% and 99.71% on the two plant disease data sets, respectively, outperforming state-of-the-art methods. Chen et al. [
28] proposed a novel domain adaptive image recognition method called simple domain adaptation network (SDAN), which combines channel and location attention modules for disease recognition in rice with a small number of samples.
The above studies significantly proved that the use of an attention mechanism could improve the accuracy of plant disease recognition models. In this study, a lightweight convolutional neural network for wheat disease based on near-ground remote sensing data, named MnasNet-SimAM, is proposed to solve the persistent problem of difficulty in recognizing crop disease in real complex environments. The SimAM module is used to extract depth features, focus on the disease locations, and avoid redundant information. In addition, the training speed and recognition ability of the network are improved through the use of improved activation functions and normalization. The main contributions of this research are outlined below:
The effectiveness of five lightweight convolutional neural networks to identify six common wheat diseases and healthy wheat is explored, based on two optimizers and three learning rate scheduling strategies.
The influence of different values of λ in the SimAM module on model recognition accuracy is studied, and the performance of the improved model is verified through visualization of the model results. Grad-Cam is used to compare the effects of different attention mechanisms in MnasNet.
The influence of agricultural pre-training of weights on the model’s dual transfer learning is analyzed.
The generalization ability of MnasNet-SimAM on public data sets is validated.
4. Discussion
Many studies have shown that the use of attention mechanisms can significantly improve the performance of models. As this study was based on transfer learning to build a wheat disease recognition network, the addition of an attention mechanism should not change the model network structure. Therefore, the SimAM module was added to the last three layers of the inverted residual network of MnasNet. The resulting model could capture more global context information and had improved understanding of the input image. The non-linear relationships between pixels could be better captured by this model. The complex features of the image could be better extracted, and useful information could be obtained while suppressing useless information [
38,
39,
40]. The size of the original model only increased by 0.01 MB after adding the SimAM module, meaning that it remained efficient for training agricultural disease image classification models. Li [
41] introduced a convolutional neural network model called Sim-ConvNeXt for maize disease classification. The SimAM attention module was integrated into this model, and the accuracy was improved by 1.5% based on the original model, consistent with the results of this study.
λ is an important parameter used by the SimAM module to calculate the importance of neurons. It is a regularization term that is used to add a small constant to the denominator when calculating the variance, thus ensuring numerical stability and avoiding division by zero. Yang [
18] explored the influence of the λ value on the SimAM module performance, and the highest accuracy was achieved with a value of 10
−5, while the performance declined with a value of 10
−6. However, in this study, the average accuracy on the test set was similar when the λ value was equal to 10
−5 or 10
−6. Different from the results obtained by Yang, the maximum accuracy of the module was 95.14% when λ was equal to 10
−6. The strength of the attention mechanism and the performance of the model may be differently affected by the λ value due to differences in the nature of the task. In this study, the attention mechanism almost failed when λ was close to zero. The model might ignore most of the information in the input data, such that its focus on important features was lost, degrading the model’s performance. When λ was too large, the SimAM module focused too much on some local features and ignored others, causing the model to be overly sensitive and/or overfit to noise and irrelevant information. The optimal value of the parameter λ can be expected to vary, according to the global or local attention required in the actual task. Therefore, determination of the parameter λ could require multiple tests and different hyperparameter adjustments. In the training of models for agricultural disease classification tasks, the size of λ needs to be reduced such that the model can pay attention to the small disease spots while ignoring the influence of the complex background.
Dual transfer learning is an advanced transfer learning technology that allows for better adaptation to the target task through the use of knowledge from multiple source tasks simultaneously. Zhao [
25] investigated the impact of dual transfer learning on ResNet50. The ImageNet data set was used for single transfer learning, while the AI Challenger 2018 data set was used for dual transfer learning. Dual transfer learning improved the model’s training efficiency and accuracy. As mentioned in the study of Mukhlif [
37], most of the previous transfer learning studies suffer from overfitting; hence, a 50% dropout layer was added to their experiments to minimize this problem. In this study, we also increased the dropout layers by 20%, but the accuracy on the test set still decreased by about 4%. This might be due to domain differences, feature mismatches, or overfitting, as reported in previous studies. Although this sped up the training convergence time, it is not cost-effective to sacrifice accuracy with high computing power and small samples. Therefore, researchers need to continue to explore ways in which training speed can be improved while maintaining accuracy in the future.
When tested on the WFD data set, the F1 score for wheat leaf rust was below 90%. This was because 4 out of 50 images of leaf rust were misclassified as stripe rust. In the study of Jiang [
42], the seven models tested also misjudged stripe rust and leaf rust in the wheat disease recognition task. The most serious misjudgment (at 8%) was observed with DenseNet-121. As a certain similarity between wheat stripe rust and leaf rust in disease spots could lead to model classification errors, we should consider how to distinguish similar diseases in future work.