1. Introduction
Deep learning (DL) is a subfield of machine learning and can extract high-level features from raw data with hierarchical convolutional neural networks architectures [
1]. With considerable advances in algorithms and dramatically lowered cost of computing hardware [
2], it has been widely applied in numerous complex applications [
3]. In agriculture, DL has become one popular technology in many aspects such as crop classification [
4], fruit grading [
5], pest detection [
6], plant disease recognition [
7], and weed detection [
8].
Despite all the advantages of DL, the drawbacks and barriers cannot be ignored especially in further application of modern agriculture. For example, the models need to be trained with large datasets of good quality and a long time to achieve an admissible amount of accuracy and relevancy [
9]. Among all the shortcomings, the lack of “explainability” is endemic and crucial to the black-box models in the widespread application, which is an important open point in artificial neural networks and deep learning models [
10]. These neural network-based black box models make the users cannot fully grasp the information for which reason the output is generated [
11,
12].
In order to make the models more transparent and interpretable, explanatory artificial intelligence (xAI) reveals its importance, which is considered to be at the highest level in explainability, accuracy, and performance [
10]. In recent years, with the continuous research of xAI research, the visualization of model internals has become one of the most intuitive methods to explore the interpretable cognitive factors of deep learning. By mapping abstract data into images, the visual representation of the model is established, which reduces the difficulty for researchers to understand the deep learning model, understands the internal expression, reduces the complexity of the model to a certain extent, and improves transparency.
Many excellent visualization methods have been proposed in recent years [
13,
14,
15]. Simonyan et al. visualized the partial derivative of the predicted class score as pixel intensity, and further modified the original gradient through backpropagation and deconvolution, resulting in quality improvement [
13]. Zeiler et al. visualized the neurons inside the deep neural network through activation maximization and sampling methods and found the maximal input image of the activated filter as much as possible, which can effectively display a specific pixel area [
16]. Through the reverse process of de-pooling-deactivation-deconvolution, the inside of the convolutional network is visualized, low-level and high-level features are found, and the target recognition ability is improved. These methods all produce fine-grained visualizations, but they cannot be used to distinguish categories. Zhou et al. proposed a more intuitive interpretability algorithm, a class activation mapping (CAM) [
17] method for localization. CAM replaces the fully connected layer with a convolutional layer and global average pooling (GAP), making full use of spatial information and robustness Stronger, through the softmax layer after GAP to achieve a specific category of feature maps. Based on CAM, Selvaraju et al. introduced a method of combining feature maps using gradient signals (GradCAM), which overcomes the shortcomings of CAM that need to modify the model structure and make it applicable to any CNN-based model [
18].
As the number of convolutional layers increases, the “black box (uninterpretability)” problem of the network framework has become more and more serious, which makes the model interpretability study more and more necessary. The application in the agriculture field is complex, for example, the scene is complex, the plant types are diverse, and the environment and other interference factors are numerous. Therefore, it is particularly important to increase the interpretability of agricultural models. However, there are few reports in this area. Ghosal et al. proposed an interpretation mechanism and made a prediction based on the top-K high-resolution feature map extracted from its local activation level using images of stressed and healthy soybean leaflets in the field. However, it did not explain the internal mechanism of the model [
19].
We explored the internal interpretability of deep learning models using the fruit leaf dataset. The specific objectives were to: (1) compare the performance of the ResNet [
20], GoogLeNet [
21], and VGG [
22] network frameworks; (2) introduce the attention mechanism to build the ResNet-Attention model; (3) compare three interpretive algorithms of SmoothGrad [
23], LIME [
24] and GradCAM [
18]. In order to study whether the model is more inclined to identify the shape features of the leaves or the texture features of the diseased spots, the dataset was rearranged into three different experiments: Experiment I is a classification experiment of fruit type and combination of diseases and insect pests, which is a multi-classification problem with 34 categories; Experiment II is a classification experiment of a fruit disease or not, it is binary-classification problems; Experiment III is based on fruit types classification, this is a multi-classification problem with 11 categories.
3. Results
Three classification experiments were carried out in this study. Each classification experiment used three frameworks based on VGG, GoogLeNet, ResNet, and two extended models of ResNet34-CBAM and ResNet50-CBAM for visual display. Experiment I is a multi-classification experiment of fruit type and a combination of diseases and insect pests. Experiment II is a binary-classification experiment on fruit disease or not. Experiment III is a multi-classification based on fruit types classification.
3.1. Classification Experiment of Fruit Species and Pests
The dataset is classified into 34 categories according to the combination of fruit types and pest types, and the number of images in each category is about 1000. The goal of Experiment I is to explore whether the model can recognize the shape features of leaves and the texture features of lesions.
The accuracy of the three models on the test set in experiment I are shown in
Table 3. It can be seen from
Table 3 that the average accuracy of VGG, GoogLeNet, and ResNet are 98.06%, 98.86%, and 99.11%, respectively. Comparing each category, it is found that although the accuracy of the three models is not much different, the classification accuracy of the ResNet network in apple scab, grape black rot, and guava whitefly is significantly better than VGG and GoogLeNet. Therefore, the ResNet model performs best on the dataset.
The attention-based module is introduced to the ResNet to construct ResNet34-CBAM and ResNet50-CBAM models.
Figure 4 shows the results of a diseased leaf in the dataset, and the results of other pictures are similar. As shown in
Figure 4, each row represents the visualization result of a model, each column represents a different visualization method and the original image, where layer1–4 corresponds to the four convolutional layers of each ResBlock in
Figure 3.
Comparing the four models in
Figure 4, the model effect of the ResNet-CBAM structure is significantly better than the ResNet model. The ResNet34 model cannot filter background information well, resulting in mediocre results. However, the ResNet34-CBAM model can overcome this shortcoming, ensuring high confidence in leaf shape and lesion characteristics. This means that the introduction of the attention module is beneficial to the feature extraction of the model, making the display results more explanatory.
Comparing the ResNet34-CBAM and ResNet50-CBAM models, the ResNet34-CBAM model has significantly better overall results on diseased leaves under the multi-classification experiment. From the display of the results of each layer, it can be found that the model first focused on the shape of the leaf and ignores the location of the diseased spot, but the model will focus on the feature of the diseased spot in the later stage, and finally combine the two sets of features to achieve a better classification effect. However, the visualization effect of the ResNet50-CBAM model is not good.
Comparing the GradCAM, SmoothGrad, and LIME, the effect of the GradCAM method is the best. This is where GradCAM comes in handy for class discrimination because it clearly shows the concerns of each layer of the network. Compared with GradCAM, the layered and concise knowledge of LIME is easier to extract, and it highlights superpixels, which means that it is possible to see how the network is explained based on patches of similar pixels. It can be seen from the renderings displayed by SmoothGrad that this method considers that the texture features of the lesions in Experiment I have a high weight (the highlights of the lesions are more concentrated), but the appearance characteristics of the leaves are not well represented. This is not in line with our experimental expectations. Therefore, the GradCAM method works best in Experiment I.
Based on the above comparison, the ResNet34-CBAM model will recognize the shape features of the leaves and the texture features of the diseased spots at the same time to achieve a better classification effect.
3.2. Classification Experiment of Fruit Leaf Disease
The dataset of Experiment II is classified according to whether the fruit is diseased or not. This experiment is a binary-classification experiment. The purpose is to study whether the model can achieve accurate classification only by identifying the texture features of leaf lesions.
The accuracy of the three models on the test set in Experiment II are shown in
Table 4. It can be seen from
Table 4 that the average accuracy rates of VGG, GoogLeNet, and ResNet are 98.03%, 98.45%, and 99.40%, respectively. The ResNet model has the best classification effect on the test set. Regardless of whether it is healthy or diseased leaves, the classification accuracy of the ResNet model is better than the VGG and GoogLeNet models.
Figure 5 shows the results of a diseased leaf and a healthy leaf in the dataset. The results of other pictures are similar. As shown in
Figure 5, each row represents the visualization result of a model, and each column represents a different visualization method and original image.
In this experiment, the effects of ResNet50 and ResNet50-CBAM models are better, and both can focus on the characteristics of lesions. Compared with the ResNet50 model, the ResNet50-CBAM model can extract leaf shape features while paying attention to lesion features, and the features extracted by this model are more detailed. By comparing each layer of GradCAM, the model first pays attention to the outline of the leaves through background details and then predicts the health of the leaves by paying attention to internal details and lesion features. The ResNet50-CBAM model gives higher weight to the lesion (the darkest lesion in Layer 3), while other models are not as good as the ResNet50-CBAM model in extracting detailed features. The ResNet50-CBAM model merges all features in the CBAM2 layer to improve the accuracy of prediction. Therefore, we have reason to believe that the ResNet50-CBAM model has the best effect in experiment II.
Comparing the three methods of GradCAM, SmoothGrad, and LIME, the GradCAM method shows the best results. In addition, the SmoothGrad method can simply describe the appearance of healthy leaves, but it cannot accurately locate the features of diseased spots. In the results displayed by the lime method, we do not have a good understanding of its explanation, but we can still find that the LIME method simply describes the appearance characteristics of the leaves.
In Experiment II, the model will focus on the texture features of the lesions, and the model will also combine the shape features of the leaves to achieve a better classification effect. With the help of the GradCAM method, it can help us understand the model prediction mechanism.
3.3. Classification Experiment of Fruit Types
The dataset of Experiment III was constructed only according to the type of fruit, and multi-classification datasets were created. The purpose of Experiment III is to explore whether the model can well recognize the shape characteristics of the blade, and observe whether the model gives a high weight to the lesion.
In Experiment III, the test accuracy of the three models on the testset after training is shown in
Table 5. It can be seen from
Table 5 that the average accuracy rates of VGG, GoogLeNet, and ResNet are 99.45%, 99.67%, and 99.89%, respectively. Although the average accuracy of the three models is not much different, the ResNet model has a greater advantage in the classification of apples and peaches. Therefore, we believe that the classification effect of the ResNet model is still the best in Experiment III.
Figure 6 shows the results of each model in Experiment III. Each row in the figure represents the display result of one model, and the columns represent the results of different interpretability methods. In the case of a single ResNet model, the display effect of the ResNet50 model is better than the ResNet34 model. Compared with Layer3, the results displayed by the ResNet34 model are more confusing, and the outline description is more blurred. On the contrary, the model with the attention module is more effective, and the ResNet-CBAM models can clearly describe the shape details of the blade and the vein details inside the blade.
Among all the ResNet-CBAM models, the ResNet50-CBAM model performs best. Compared with the ResNet34-CBAM model, this model relaxes the confidence of the diseased spots inside the leaves and increases the weight of the leaf shape features. This is consistent with experimental expectations. However, the results displayed by the ResNet34-CBAM model are rather vague, and there is no distinction between leaf contours and inner veins. From the analysis of the diseased leaf results, the ResNet34-CBAM model gives higher weight to the diseased spot location. Overall, the ResNet50-CBAM model is the best model.
Comparing the three methods of GradCAM, SmoothGrad, and LIME, the GradCAM method shows the best results. In addition, the SmoothGrad method can simply describe the appearance characteristics of leaves and does not give higher weight to lesions. In the results displayed by the LIME method, we do not have a good understanding of its interpretation, but we can still find that the LIME method simply describes the appearance of the leaves.
Therefore, we confirm that in Experiment III, the ResNet50-CBAM model will pay attention to the shape characteristics of the leaves and relax the texture characteristics of the lesions.
4. Discussion
We also analyzed the background and shadow parts, as shown in
Figure 7, under the optimal weight of the Experiment II model, two images with subtle differences in the background and no shadow parts of the leaves are selected for visualization. The visualization results of other leaves are similar. We randomly selected a group of images for display. It can be seen from the figure that in the case of excluding the two unfavorable factors, the four models do not give a high weight to the background part in the prediction process, that is to say, the model pays more attention to the classification task characteristics of the leaves themselves, rather than classification by the background of the image. By comparing the visualization results of various models, the ResNet50-CBAM model has the best visualization results for healthy leaves and presents better visualization results among the three interpretability algorithms. When the ResNet50-CBAM model predicts healthy leaves, it only pays attention to the venation features inside the leaves and the contour features of the leaves. It can be observed from the LIME and SmoothGrad methods that the ResNet50-CBAM model pays more attention to the characteristics of the leaf itself than other models, and does not care about the background, which is helpful for us to understand the operating mechanism of the model. For the classification of diseased leaves, the ResNet50-CBAM model can make good predictions only by the leaf disease characteristics, while the other three models more or less need the help of leaf texture features to assist in prediction. Therefore, we believe that the ResNet50-CBAM model performs the best in Experiment II, and the presented results are also the most consistent with people’s judgment logic in this task. Additionally, in the prediction results of the ResNet50-CBAM model, it is further confirmed that the background and shadow have little effect on the prediction of the model.
Based on the above three groups of experiments, the ResNet50-CBAM model has the best effect. In order to further verify that the attention module can improve the feature extraction ability of the ResNet model in different experiments, we listed the ResNet50-CBAM model on the same grape black rot leaf, shown in
Figure 8. As shown in
Figure 8, the overall result of the ResNet50-CBAM model is better than that of the ResNet50 model, and the ResNet50-CBAM model has higher confidence in the leaf shape features, and the focus of the model is different for different experiments. However, the ResNet50 model will give a certain weight to the background noise, which is obviously not the result we want. Comparing the output results of the Layer4 and CBAM2 layers of the three sets of experiments, it is found that the CBAM module can better grasp the focus of the image. For example, in Experiment II, the ResNet50-CBAM model can well integrate the appearance and lesion features of the leaves, and the model can extract more features to improve the predictive ability of the model. In contrast, the ResNet model does not integrate the features well, discarding part of the feature information. Therefore, after adding the CBAM layer on the basis of the ResNet framework, the feature extraction is more detailed, which can effectively improve the predictive ability of the model. Comparing the three methods of GradCAM, SmoothGrad, and LIME, the GradCAM method shows the most intuitive results and is the easiest to understand. In addition, it can be seen from the SmoothGrad method that the pixels of the lesion position are the most important, but the effect of explaining the appearance of the leaves is not obvious. The results shown by the LIME method do not meet our experimental expectations. The LIME method simply describes the appearance of the leaves and does not have good explanatory power. These two methods are not as obvious as the GradCAM method in terms of visualization. Therefore, the combination of the ResNet50-CBAM model and the GradCAM method has a better interpretation effect.
In order to verify the generalization ability of the model, we chose an image of eggplant disease from the Internet, and produced a visualization result prediction under the weight of the ResNet50-CBAM model in Experiment II, as shown in
Figure 9. The model can make accurate classifications by the location of leaf lesions in all three interpretability algorithm predictions. As can be seen in
Figure 9b,d, the model assigns a higher weight to the position of the lesion, and the healthy leaf can be classified by the texture features of the lesion. It is further confirmed that the model has good generalization ability, so it can be applied to more leaf classification scenarios.
5. Conclusions
We studied the interpretability of the different classification models based on the fruit disease leaves dataset. We designed three different experiments on the dataset: Experiment I is a classification experiment combining fruit species and pest species, which is a multi-classification problem, focusing on whether the model can simultaneously recognize the texture, shape, and lesion characteristics of the leaves. Experiment II is a fruit disease classification experiment, which is a binary classification problem, focusing on whether the model can well recognize the texture characteristics of leaves. Experiment III is a multi-classification experiment based on fruit types, focusing on whether the model can well recognize the shape characteristics of leaves. In each experiment, the VGG, GoogLeNet, and ResNet models were used and the ResNet-attention model was applied with three interpretable methods. Through three sets of experimental studies, we have confirmed that the ResNet model has the best accuracy in our classification tasks, which are 99.11%, 99.4%, and 99.89%, respectively. The ResNet-CBAM model is constructed by introducing the attention module. The model is conducive to improving the ability of the model to extract key features and can enhance the generalization power of the model. In addition, by comparing the three visualization methods SmoothGrad, LIME, and GradCAM, the GradCAM method is the most suitable for agricultural classification tasks among the three methods.
Finally, through the above series of experiments, we clarified the internal interpretability of the convolution-based neural network model in dealing with common leaf diseases and insect pests and clarified the focus of the model in feature extraction in the three different sets of experiments. The attention module can effectively improve the feature extraction ability of the model. Combined with three interpretability methods, it shows that the feature extraction results of the model in different agricultural classification tasks are different. This research will help practitioners in the agricultural field make better use of deep learning methods to deal with classification problems in the agricultural field.