Wild Mushroom Classification Based on Improved MobileViT Deep Learning
Abstract
:1. Introduction
- The M-ViT network model was constructed and fine-tuned by adding an improved multidimensional attention module parallel to the MobileViT network. It enables the model to obtain a more effective interaction of local features and global feature information, more suitable for classifying wild mushroom datasets. A thorough search of the literature shows that this is the first study to use a combined Transformer and CNN model for the classification of wild mushrooms;
- The MV2 module in the original network combined with an improved attention mechanism for enhanced representation of important channels;
- For the “black box” problem of deep learning, we performed an interpretable analysis of the model, drawing the confusion matrix and heat map of the model.
1.1. Convolutional Networks (ConvNets)
1.2. Vision Transformers
2. Materials and Methods
2.1. Datasets
2.1.1. Mushroom
2.1.2. MO106
2.2. Data Enhancement
2.3. M-ViT Model
2.3.1. Inverted Bottleneck
2.3.2. M-ViT Block
2.3.3. gMLP Block
2.3.4. Multidimension Attention
- Block Attention: For the input feature map , it is transformed into a shape tensor to represent the division into non-overlapping windows, where the size of each window is , and finally, the RelAttention computation is performed in windows. First, we define the block (·) operation to convert the input feature graph into a non-overlapping window of size, and then we define unblock (·) to do the opposite. The Block-SA Block in Figure 6 is to divide the Block into Windows and then performs the RelAttention operation, as in the following equations:
- Grid Attention: Grid attention uses Grid(·) to convert input features into uniform grid to grid the input tensor as at which point the window of adaptive size is obtained , and finally on The RelAttention calculation is performed on the grid attention. Unlike the Block operation, we need additional transposes to place the grid dimension on the assumed spatial axis. Additionally, the Ungrid (·) is defined to perform the inverse operation to return it to 2D space, as in the follwoing equations:
- Global Attention: For the input feature map , we first operate on the input with and convolution to obtain . DW convolution is used to learn the local and channel spatial information to prevent the loss of spatial information of channels, and convolution is used to project the input features into the high-dimensional space. A visual transformer (ViT) with multi-headed self-attentiveness is used for modeling to obtain longer-distance relations. However, ViT has many parameters and weak optimization capability because ViT lacks inductive bias. To enable the Transformer (ViT) to learn a global representation with spatial inductive bias, first, the is expanded into N non-overlapping patches , where , is the number of patches and , is the height and width of the patches. Each pixel in the patches is modeled by Transformer to get , as in the following equations:
3. Results
3.1. Implementation Details and Performance Evaluation Metrics
- Accuracy: This indicates the accuracy of the prediction result, the number of correctly predicted samples divided by the total number of samples, as in the following equation:
- Precision: This indicates the probability of correctly predicting a positive sample among the predicted to be positive in the prediction result, as in the following equation:
- Recall: Also known as the sensitivity, this indicates the probability that a positive sample of the original sample will be correctly predicted as a positive sample in the end, as in the follwing equation:
- F1-Score: This represents the harmonic mean evaluation metric of precision and recall. The F1-score is a weighted average of the model’s precision and recall, with a maximum value of 1 and a minimum value of 0, as in the follwing equation:
- Specificity: This describes the proportion of identified negative samples to all negative samples, as in the following equation:
3.2. Experimental Results and Analysis
3.3. Ablation Experiments
3.4. Model Evaluation and Interpretability Analysis
3.4.1. Confusion Matrix
3.4.2. Grad-CAM
4. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Liu, Q.; Fang, M.; Li, Y.; Gao, M. Deep learning based research on quality classification of shiitake mushrooms. LWT 2022, 168, 113902. [Google Scholar] [CrossRef]
- Molina-Castillo, S.; Espinoza-Ortega, A.; Thomé-Ortiz, H.; Moctezuma-Pérez, S. Gastronomic diversity of wild edible mushrooms in the Mexican cuisine. Int. J. Gastron. Food Sci. 2023, 31, 100652. [Google Scholar] [CrossRef]
- Ford, W.W. A new classification of mycetismus (mushroom poisoning). J. Pharmacol. Exp. Ther. 1926, 29, 305–309. [Google Scholar]
- Tutuncu, K.; Cinar, I.; Kursun, R.; Koklu, M. Edible and poisonous mushrooms classification by machine learning algorithms. In Proceedings of the 2022 11th Mediterranean Conference on Embedded Computing (MECO), Budva, Montenegro, 7–10 June 2022; pp. 1–4. [Google Scholar]
- Abdulnabi, A.H.; Wang, G.; Lu, J.; Jia, K. Multi-task CNN model for attribute prediction. IEEE Trans. Multimed. 2015, 17, 1949–1959. [Google Scholar] [CrossRef] [Green Version]
- Han, K.; Xiao, A.; Wu, E.; Guo, J.; Xu, C.; Wang, Y. Transformer in transformer. Adv. Neural Inf. Process. Syst. 2021, 34, 15908–15919. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef] [Green Version]
- Guo, Y.; Zheng, Y.; Tan, M.; Chen, Q.; Li, Z.; Chen, J.; Zhao, P.; Huang, J. Towards accurate and compact architectures via neural architecture transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 6501–6516. [Google Scholar] [CrossRef]
- Simonyan, K.; Zisserman, A. Two-stream convolutional networks for action recognition in videos. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; Volume 27. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
- Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 10–15 June 2019; pp. 6105–6114. [Google Scholar]
- Radosavovic, I.; Kosaraju, R.P.; Girshick, R.; He, K.; Dollár, P. Designing network design spaces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10428–10436. [Google Scholar]
- Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
- Kang, E.; Han, Y.; Oh, I.S. Mushroom image recognition using convolutional neural network and transfer learning. KIISE Trans. Comput. Pract. 2018, 24, 53–57. [Google Scholar] [CrossRef]
- Xiao, J.; Zhao, C.; Li, X.; Liu, Z.; Pang, B.; Yang, Y.; Wang, J. Research on mushroom image classification based on deep learning. Softw. Eng. 2020, 23, 21–26. [Google Scholar]
- Shen, R.; Huang, Y.; Wen, X.; Zhang, L. Mushroom classification based on Xception and ResNet50 models. J. Heihe Univ. 2020, 11, 181–184. [Google Scholar]
- Shuaichang, F.; Xiaomei, Y.; Jian, L. Toadstool image recognition based on deep residual network and transfer learning. J. Transduct. Technol. 2020, 33, 74–83. [Google Scholar]
- Yuan, P.; Shen, C.; Xu, H. Fine-grained mushroom phenotype recognition based on transfer learning and bilinear CNN. Trans. Chin. Soc. Agric. Mach. 2021, 52, 151–158. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 10347–10357. [Google Scholar]
- Dong, X.; Bao, J.; Chen, D.; Zhang, W.; Yu, N.; Yuan, L.; Chen, D.; Guo, B. Cswin transformer: A general vision transformer backbone with cross-shaped windows. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 12124–12134. [Google Scholar]
- Yang, J.; Li, C.; Zhang, P.; Dai, X.; Xiao, B.; Yuan, L.; Gao, J. Focal self-attention for local-global interactions in vision transformers. arXiv 2021, arXiv:2107.00641. [Google Scholar]
- Xu, R.; Tu, Z.; Xiang, H.; Shao, W.; Zhou, B.; Ma, J. CoBEVT: Cooperative bird’s eye view semantic segmentation with sparse transformers. arXiv 2022, arXiv:2207.02202. [Google Scholar]
- Yuan, L.; Chen, Y.; Wang, T.; Yu, W.; Shi, Y.; Jiang, Z.H.; Tay, F.E.; Feng, J.; Yan, S. Tokens-to-token vit: Training vision transformers from scratch on imagenet. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 558–567. [Google Scholar]
- Li, Y.; Wu, C.Y.; Fan, H.; Mangalam, K.; Xiong, B.; Malik, J.; Feichtenhofer, C. Improved multiscale vision transformers for classification and detection. arXiv 2021, arXiv:2112.01526. [Google Scholar]
- Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 568–578. [Google Scholar]
- Xu, R.; Xiang, H.; Tu, Z.; Xia, X.; Yang, M.H.; Ma, J. V2X-ViT: Vehicle-to-everything cooperative perception with vision transformer. In Proceedings of the Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, 23–27 October 2022; pp. 107–124. [Google Scholar]
- Bello, I.; Fedus, W.; Du, X.; Cubuk, E.D.; Srinivas, A.; Lin, T.Y.; Shlens, J.; Zoph, B. Revisiting resnets: Improved training and scaling strategies. Adv. Neural Inf. Process. Syst. 2021, 34, 22614–22627. [Google Scholar]
- Touvron, H.; Cord, M.; Sablayrolles, A.; Synnaeve, G.; Jégou, H. Going deeper with image transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 32–42. [Google Scholar]
- Zhou, D.; Kang, B.; Jin, X.; Yang, L.; Lian, X.; Jiang, Z.; Hou, Q.; Feng, J. Deepvit: Towards deeper vision transformer. arXiv 2021, arXiv:2103.11886. [Google Scholar]
- Wang, B. Automatic Mushroom Species Classification Model for Foodborne Disease Prevention Based on Vision Transformer. J. Food Qual. 2022, 2022, 1173102. [Google Scholar] [CrossRef]
- Tan, M.; Le, Q. Efficientnetv2: Smaller models and faster training. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 10096–10106. [Google Scholar]
- Brock, A.; De, S.; Smith, S.L.; Simonyan, K. High-performance large-scale image recognition without normalization. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 1059–1071. [Google Scholar]
- Mehta, S.; Rastegari, M. Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer. arXiv 2021, arXiv:2110.02178. [Google Scholar]
- Tu, Z.; Talebi, H.; Zhang, H.; Yang, F.; Milanfar, P.; Bovik, A.; Li, Y. Maxvit: Multi-axis vision transformer. In Proceedings of the Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, 23–27 October 2022; pp. 459–479. [Google Scholar]
- Kiss, N.; Czúni, L. Mushroom image classification with CNNs: A case-study of different learning strategies. In Proceedings of the 2021 12th International Symposium on Image and Signal Processing and Analysis (ISPA), Zagreb, Croatia, 13–15 September 2021; pp. 165–170. [Google Scholar]
- Zhuang, F.; Qi, Z.; Duan, K.; Xi, D.; Zhu, Y.; Zhu, H.; Xiong, H.; He, Q. A comprehensive survey on transfer learning. Proc. IEEE 2020, 109, 43–76. [Google Scholar] [CrossRef]
- Liu, H.; Dai, Z.; So, D.; Le, Q.V. Pay attention to mlps. Adv. Neural Inf. Process. Syst. 2021, 34, 9204–9215. [Google Scholar]
- Chu, X.; Tian, Z.; Zhang, B.; Wang, X.; Wei, X.; Xia, H.; Shen, C. Conditional positional encodings for vision transformers. arXiv 2021, arXiv:2102.10882. [Google Scholar]
- Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar]
- Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]
Method | Epoch | Batch Size | Accuracy (%) | Loss | ||
---|---|---|---|---|---|---|
MO106 | Mushroom | MO106 | Mushroom | |||
Resnet34 | 300 | 32 | 87.44 | 75.0 | 0.43 | 0.178 |
Mobilenet-V2 | 300 | 32 | 76.49 | 65.2 | 1.37 | 0.944 |
Efficienet | 300 | 32 | 89.91 | 84.9 | 0.07 | 0.126 |
ConvNeXt | 300 | 32 | 88.23 | 80.24 | 1.056 | 1.079 |
Transfomer | 300 | 32 | 90.90 | 78.25 | 0.301 | 1.027 |
Swin-transformer | 300 | 32 | 88.47 | 76.31 | 0.622 | 1.512 |
ShuffleNet | 300 | 32 | 89.54 | 83.5 | 0.12 | 0.164 |
Backbone | 300 | 32 | 92.58 | 86.11 | 0.603 | 0.90 |
Our method | 300 | 32 | 96.21 | 91.83 | 1.053 | 0.871 |
Class | Precision (%) | Recall (%) | Specificity (%) | F1-Score (%) |
---|---|---|---|---|
Conditionally_edible | 93.2 | 82.1 | 99.7 | 87.3 |
Deadly | 90.4 | 74.8 | 98.7 | 81.9 |
Edible | 97.6 | 92.3 | 99.1 | 94.9 |
Poisonous | 89.3 | 96.8 | 86.8 | 92.9 |
Layer2 | Layer3 | Layer4 | Dataset | Accuracy (%) | Dataset | Accuracy (%) | ||
---|---|---|---|---|---|---|---|---|
Backbone | × | × | × | MO106 | 92.58 | mushroom | 86.11 | |
Ours | 1 | √ | × | × | MO106 | 94.74 | mushroom | 90.13 |
2 | × | √ | × | MO106 | 94.69 | mushroom | 90.2 | |
3 | × | × | √ | MO106 | 95.04 | mushroom | 90.34 | |
4 | √ | √ | × | MO106 | 87.53 | mushroom | 79.54 | |
5 | × | √ | √ | MO106 | 89.41 | mushroom | 82.63 | |
6 | √ | √ | √ | MO106 | 83.59 | mushroom | 75.6 |
Block | Dataset | Accuracy (%) | Dataset | Accuracy (%) |
---|---|---|---|---|
MLP | MO106 | 95.04 | mushroom | 90.34 |
gMLP | MO106 | 96.15 | mushroom | 91.42 |
GluMLP | MO106 | 87.11 | mushroom | 90.00 |
SE | Dataset | Accuracy (%) | Dataset | Accuracy (%) |
---|---|---|---|---|
√ | MO106 | 95.47 | mushroom | 91.37 |
× | MO106 | 96.15 | mushroom | 91.42 |
ASPP | Dataset | Accuracy (%) | Dataset | Accuracy (%) |
---|---|---|---|---|
√ | MO106 | 96.21 | mushroom | 91.83 |
× | MO106 | 96.15 | mushroom | 91.42 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Peng, Y.; Xu, Y.; Shi, J.; Jiang, S. Wild Mushroom Classification Based on Improved MobileViT Deep Learning. Appl. Sci. 2023, 13, 4680. https://doi.org/10.3390/app13084680
Peng Y, Xu Y, Shi J, Jiang S. Wild Mushroom Classification Based on Improved MobileViT Deep Learning. Applied Sciences. 2023; 13(8):4680. https://doi.org/10.3390/app13084680
Chicago/Turabian StylePeng, Youju, Yang Xu, Jin Shi, and Shiyi Jiang. 2023. "Wild Mushroom Classification Based on Improved MobileViT Deep Learning" Applied Sciences 13, no. 8: 4680. https://doi.org/10.3390/app13084680
APA StylePeng, Y., Xu, Y., Shi, J., & Jiang, S. (2023). Wild Mushroom Classification Based on Improved MobileViT Deep Learning. Applied Sciences, 13(8), 4680. https://doi.org/10.3390/app13084680