Multimodal Fine-Grained Transformer Model for Pest Recognition
Abstract
:1. Introduction
- The multimodal fine-grained transformer model for pest recognition is proposed. The MMFGT extends the ViT model through self-supervised learning and fine-grained recognition to address the problem of pest recognition; i.e., the target is small in the image and a large sample dataset is required for training;
- Compared to existing methods, the proposed MMFGT model can provide a better-performing classification strategy for pest recognition by using joint multimodal information from images and natural language descriptions;
- Extensive experimental results demonstrate that the MMFGT can obtain more competitive results compared to existing image recognition models in the pest recognition task. The accuracy levels of the MMFGT with the IDADP pest dataset and tomato pest dataset were 98.12% and 95.83%, respectively, 5.92% and 4.16% higher than the state-of-the-art DINO method for the baseline.
2. Related Work
2.1. Image Classification
2.2. Transformer
2.3. Self-Supervised Learning
2.4. Multimodal Learning
3. Materials and Methods
3.1. Datasets
3.2. The Structure of the Proposed Model
3.2.1. Image Encoder
Self-Supervised Learning Architecture
Improvement of the ViT Architecture with a PSM
3.2.2. Text Encoder
4. Experiments and Discussion
4.1. Experimental Settings
4.2. Evaluation Metrics
4.3. Baseline
4.4. Experimental Results
4.5. Generalization Performance of the Model
4.6. Visual Analysis
4.7. Ablation Experiments
4.8. Discussion
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
MMFGT | multimodal fine-grained transformer model |
CNN | convolutional neural network |
ViT | vision transformer |
IDADP | Image Database for Agricultural Diseases and Pests Research |
FGT | fine-grained transformer model |
BERT | bidirectional encoder representations from transformer |
ALBERT | a lite BERT |
PSM | part selection module |
SwinT | hierarchical vision transformer using shifted windows |
DINO | a form of self-distillation with no labels |
EsViT | efficient self-supervised vision transformer |
References
- Alzubaidi, L.; Zhang, J.; Humaidi, A.J.; Al-Dujaili, A.Q.; Duan, Y.; Al-Shamma, O.; Santamaría, J.; Fadhel, M.A.; Al-Amidie, M.; Farhan, L. Review of deep learning: Concepts, CNN architectures, challenges, applications, future directions. J. Big Data 2021, 8, 53. [Google Scholar] [CrossRef] [PubMed]
- Dawei, W.; Limiao, D.; Jiangong, N.; Jiyue, G.; Hongfei, Z.; Zhongzhi, H. Recognition pest by image-based transfer learning. J. Sci. Food Agric. 2019, 99, 4524–4531. [Google Scholar] [CrossRef] [PubMed]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. In Proceedings of the 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, 3–7 May 2021. [Google Scholar]
- Cheng, X.; Zhang, Y.; Chen, Y.; Wu, Y.; Yue, Y. Pest identification via deep residual learning in complex background. Comput. Electron. Agric. 2017, 141, 351–356. [Google Scholar] [CrossRef]
- Ren, F.; Liu, W.; Wu, G. Feature Reuse Residual Networks for Insect Pest Recognition. IEEE Access 2019, 7, 122758–122768. [Google Scholar] [CrossRef]
- Xia, D.; Chen, P.; Wang, B.; Zhang, J.; Xie, C. Insect Detection and Classification Based on an Improved Convolutional Neural Network. Sensors 2018, 18, 4169. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Huo, M.; Tan, J. Overview: Research Progress on Pest and Disease Identification. In Lecture Notes in Computer Science, Proceedings of the Pattern Recognition and Artificial Intelligence—International Conference, ICPRAI 2020, Zhongshan, China, 19–23 October 2020; Lu, Y., Vincent, N., Yuen, P.C., Zheng, W., Cheriet, F., Suen, C.Y., Eds.; Springer: Berlin/Heidelberg, Germany, 2020; Volume 12068, pp. 404–415. [Google Scholar]
- Brahimi, M.; Boukhalfa, K.; Moussaoui, A. Deep Learning for Tomato Diseases: Classification and Symptoms Visualization. Appl. Artif. Intell. 2017, 31, 299–315. [Google Scholar] [CrossRef]
- Mohanty, S.P.; Hughes, D.P.; Salathé, M. Using Deep Learning for Image-Based Plant Disease Detection. Front. Plant Sci. 2016, 22, 1419. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Tetila, E.C.; Machado, B.B.; Menezes, G.K.; da Silva Oliveira Junior, A.; Alvarez, M.A.; Amorim, W.P.; de Souza Belete, N.A.; da Silva, G.G.; Pistori, H. Automatic Recognition of Soybean Leaf Diseases Using UAV Images and Deep Convolutional Neural Networks. IEEE Geosci. Remote Sens. Lett. 2020, 17, 903–907. [Google Scholar] [CrossRef]
- Pattnaik, G.; Shrivastava, V.K.; Parvathi, K. Transfer Learning-Based Framework for Classification of Pest in Tomato Plants. Appl. Artif. Intell. 2020, 34, 981–993. [Google Scholar] [CrossRef]
- Chen, J.; Chen, J.; Zhang, D.; Sun, Y.; Nanehkaran, Y.A. Using deep transfer learning for image-based plant disease identification. Comput. Electron. Agric. 2020, 173, 105393. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical Vision transformer using shifted windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, 10–17 October 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 9992–10002. [Google Scholar]
- He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R.B. Momentum contrast for unsupervised visual representation learning. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, 13–19 June 2020; Computer Vision Foundation/IEEE: Piscataway, NJ, USA, 2020; pp. 9726–9735. [Google Scholar]
- Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G.E. A simple framework for contrastive learning of visual representations. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, Virtual Event, 13–18 July 2020; Volume 119, pp. 1597–1607. [Google Scholar]
- Chen, X.; Fan, H.; Girshick, R.B.; He, K. Improved Baselines with Momentum Contrastive Learning. arXiv 2020, arXiv:2003.04297. [Google Scholar]
- Chen, T.; Kornblith, S.; Swersky, K.; Norouzi, M.; Hinton, G.E. Big self-supervised models are strong semi-supervised learners. In Proceedings of the Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, Virtual Conference, 6–12 December 2020. [Google Scholar]
- Grill, J.; Strub, F.; Altché, F.; Tallec, C.; Richemond, P.H.; Buchatskaya, E.; Doersch, C.; Pires, B.Á.; Guo, Z.; Azar, M.G.; et al. Bootstrap your own latent—A new approach to self-supervised learning. In Proceedings of the Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, Virtual Conference, 6–12 December 2020. [Google Scholar]
- Chen, X.; He, K. Exploring simple siamese representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, Virtual Conference, 19–25 June 2021; Computer Vision Foundation/IEEE: Piscataway, NJ, USA, 2021; pp. 15750–15758. [Google Scholar]
- Caron, M.; Touvron, H.; Misra, I.; Jégou, H.; Mairal, J.; Bojanowski, P.; Joulin, A. Emerging properties in self-supervised vision transformers. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, 10–17 October 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 9630–9640. [Google Scholar]
- Li, C.; Yang, J.; Zhang, P.; Gao, M.; Xiao, B.; Dai, X.; Yuan, L.; Gao, J. Efficient self-supervised vision transformers for representation learning. In Proceedings of the Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, 25–29 April 2022. [Google Scholar]
- Ovalle, J.E.A.; Solorio, T.; Montes-y-Gómez, M.; González, F.A. Gated multimodal units for information fusion. In Proceedings of the 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, 24–26 April 2017. [Google Scholar]
- Kiela, D.; Grave, E.; Joulin, A.; Mikolov, T. Efficient large-scale multi-modal classification. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th Innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, LA, USA, 2–7 February 2018; McIlraith, S.A., Weinberger, K.Q., Eds.; AAAI Press: Washington, DC, USA, 2018; pp. 5198–5204. [Google Scholar]
- Wang, L.; Li, Y.; Lazebnik, S. Learning deep structure-preserving image-text embeddings. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, 27–30 June 2016; IEEE Computer Society: Piscataway, NJ, USA, 2016; pp. 5005–5013. [Google Scholar]
- Faghri, F.; Fleet, D.J.; Kiros, J.R.; Fidler, S. VSE++: Improving visual-semantic embeddings with hard negatives. In Proceedings of the British Machine Vision Conference 2018, BMVC 2018, Newcastle, UK, 3–6 September 2018; BMVA Press: Durham, UK, 2018; p. 12. [Google Scholar]
- Fukui, A.; Park, D.H.; Yang, D.; Rohrbach, A.; Darrell, T.; Rohrbach, M. Multimodal compact bilinear pooling for visual question answering and visual grounding. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, TX, USA, 1–4 November 2016; Su, J., Carreras, X., Duh, K., Eds.; The Association for Computational Linguistics: Toronto, ON, Canada, 2016; pp. 457–468. [Google Scholar]
- Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; Zhang, L. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, 18–22 June 2018; Computer Vision Foundation/IEEE Computer Society: Piscataway, NJ, USA, 2018; pp. 6077–6086. [Google Scholar]
- He, X.; Peng, Y. Fine-grained image classification via combining vision and language. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 21–26 July 2017; IEEE Computer Society: Piscataway, NJ, USA, 2017; pp. 7332–7340. [Google Scholar]
- Nawaz, S.; Calefati, A.; Caraffini, M.; Landro, N.; Gallo, I. Are these birds similar: Learning Branched networks for fine-grained representations. In Proceedings of the 2019 International Conference on Image and Vision Computing New Zealand, IVCNZ 2019, Dunedin, New Zealand, 2–4 December 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–5. [Google Scholar]
- Gallo, I.; Ria, G.; Landro, N.; Grassa, R.L. Image and TEXT FUSION FOR UPMC food-101 using BERT and CNNs. In Proceedings of the 35th International Conference on Image and Vision Computing New Zealand, IVCNZ 2020, Wellington, New Zealand, 25–27 November 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1–6. [Google Scholar]
- Zhou, J.; Li, J.; Wang, C.; Wu, H.; Zhao, C.; Teng, G. Crop disease identification and interpretation method based on multimodal deep learning. Comput. Electron. Agric. 2021, 189, 106408. [Google Scholar] [CrossRef]
- Yuan, Y.; Chen, L.; Ren, Y.; Wang, S.; Li, Y. Impact of dataset on the study of crop disease image recognition. Int. J. Agric. Biol. Eng. 2022, 15, 181–186. [Google Scholar] [CrossRef]
- Huang, M.L.; Chuang, T.C. A Database of Eight Common Tomato Pest Images. 2020. Available online: https://data.mendeley.com/datasets/s62zm6djd2/1 (accessed on 12 February 2023).
- Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; Soricut, R. ALBERT: A Lite BERT for Self-supervised learning of language representations. In Proceedings of the 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
- He, J.; Chen, J.; Liu, S.; Kortylewski, A.; Yang, C.; Bai, Y.; Wang, C. TransFG: A transformer architecture for fine-grained recognition. In Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI 2022, Virtual Event, 22 February–1 March 2022; AAAI Press: Washington, DC, USA, 2022; pp. 852–860. [Google Scholar]
- Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June 2019; Burstein, J., Doran, C., Solorio, T., Eds.; Association for Computational Linguistics: Toronto, On, Canada, 2019; Volume 1 (Long and Short Papers), pp. 4171–4186. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, 27–30 June 2016; IEEE Computer Society: Piscataway, NJ, USA, 2016; pp. 770–778. [Google Scholar]
Pest Name | Number |
---|---|
Colposcelis signata | 41 |
Piezodorus rubrofasciatus | 19 |
Riptortus pedestris | 54 |
Eysacoris guttiger | 6 |
Erthesina fullo | 40 |
Membracidae | 41 |
Acrida cinerea | 13 |
Tingidae | 18 |
Oxya | 10 |
Scurelleridae | 9 |
Spoladea recurvalis | 38 |
Cletus schmidti Kiritshenko | 40 |
Ascotis selenaria Schiffermuller et Denis | 36 |
Helicoverpa armigera | 39 |
Berytidae | 79 |
Taiwania | 25 |
Aphidoidea | 124 |
Eurygaster testudinarius | 19 |
Spodoptera frugiperda | 95 |
Trigonotylus ruficornis Geoffroy | 28 |
Riptortus linearis Fabricius | 65 |
Rhopalosiphum maidis | 19 |
Pygmy sand cricket | 17 |
Atractomorpha sinensis Bolivar | 90 |
Tropidothorax elegans Distant | 30 |
Cletus punctiger Dallas | 93 |
Dolycoris baccarum | 120 |
Nysius ericae | 62 |
Longhorned grasshoppers | 23 |
Pest Name | Image | Description Text |
---|---|---|
Aphidoidea | Body length 2 mm, green, ovoid. Eyes large, small ocular surface. Ventral tube tubular, apical margin protruding, surface smooth or tiled. Tail plate end round | |
Acrida cinerea | Body medium to large, green in color. Head conical. Face extremely inclined, face bulge extremely narrow. Head protruding with rounded apex. Antennae sword-shaped. Compound eyes long-oval | |
Atractomorpha sinensis Bolivar | The body is green. Head sharpened, protruding forward, with small yellow tuberculate projections on lateral margins. Forewings green, exceeding the abdomen; hindwings red at the base and light green at the tip | |
Membracidae | The body is yellowish brown, narrow and long, with dense carving points. The top of the head and the anterior margin of the dorsal plate of the prothorax are dotted with small black grains. Compound eyes maroon, single eye red. The lateral horns are elongated and black at the end |
Pest Name | Image | Number | Pest Name | Image | Number |
---|---|---|---|---|---|
Bemisia argentifolii | 61 | Spodoptera litura | 97 | ||
Helicoverpa armigera | 109 | Thrips palmi | 25 | ||
Myzus persicae | 131 | Tetranychus urticae | 75 | ||
Spodoptera exigua | 76 | Zeugodacus cucurbitae | 43 |
Model | ACC (%) | P (%) | R (%) | F1 (%) |
---|---|---|---|---|
ResNet101 | 68.90 | 63.79 | 56.37 | 52.43 |
ViT | 86.33 | 86.45 | 83.50 | 84.33 |
SwinT | 87.67 | 87.98 | 85.05 | 84.68 |
DINO | 92.20 | 93.29 | 91.16 | 92.07 |
EsViT | 91.20 | 91.85 | 91.05 | 91.23 |
FGT | 97.32 | 98.67 | 96.99 | 97.51 |
MMFGT | 98.12 | 99.07 | 98.56 | 98.50 |
Model | ACC (%) | P (%) | R (%) | F1 (%) |
---|---|---|---|---|
ResNet101 | 74.48 | 67.83 | 62.90 | 63.59 |
ViT | 89.58 | 79.39 | 74.42 | 75.36 |
SwinT | 84.90 | 73.84 | 73.08 | 73.06 |
DINO | 92.19 | 92.89 | 91.66 | 92.07 |
EsViT | 91.67 | 92.32 | 91.97 | 91.84 |
FGT | 96.13 | 98.42 | 96.11 | 97.16 |
MMFGT | 97.40 | 98.96 | 96.85 | 97.20 |
Model | ACC (%) | P (%) | R (%) | F1 (%) |
---|---|---|---|---|
ResNet101 | 72.93 | 66.37 | 65.65 | 64.53 |
ViT | 85.64 | 85.43 | 85.25 | 84.72 |
SwinT | 84.53 | 85.15 | 84.78 | 84.37 |
DINO | 92.27 | 92.45 | 91.97 | 91.48 |
EsViT | 92.82 | 93.01 | 91.95 | 91.82 |
FGT | 96.13 | 98.31 | 96.37 | 97.49 |
MMFGT | 100.00 | 100.00 | 100.00 | 100.00 |
Model | ACC (%) | P (%) | R (%) | F1 (%) |
---|---|---|---|---|
ResNet101 | 76.19 | 63.50 | 67.59 | 64.57 |
ViT | 83.93 | 90.78 | 77.00 | 76.58 |
SwinT | 85.12 | 86.29 | 85.42 | 84.03 |
DINO | 91.67 | 91.88 | 91.38 | 90.29 |
EsViT | 91.07 | 91.67 | 90.82 | 90.04 |
FGT | 95.24 | 96.23 | 93.05 | 94.28 |
MMFGT | 95.83 | 96.12 | 96.24 | 96.16 |
Module | Attention Mechanism | Self-Supervised Learning | Fine- Grained Mechanism | Text | IDADP ACC (%) | Tomato ACC (%) | |
---|---|---|---|---|---|---|---|
Method | |||||||
ResNet101 | 68.9 | 76.19 | |||||
ViT | ✓ | 86.33 | 83.93 | ||||
SwinT | ✓ | 87.67 | 85.12 | ||||
DINO | ✓ | ✓ | 92.20 | 91.67 | |||
EsViT | ✓ | ✓ | 91.20 | 91.07 | |||
FGT | ✓ | ✓ | ✓ | 97.32 | 95.24 | ||
MMFGT | ✓ | ✓ | ✓ | ✓ | 98.12 | 95.83 |
Model | Training Time (s) | Inference Time (ms) | Parameters (M) | ACC (%) |
---|---|---|---|---|
ViT | 2067 | 5.78 | 86 | 84.33 |
SwinT | 2432 | 6.95 | 88 | 84.68 |
DINO | 2217 | 6.75 | 85 | 92.07 |
EsViT | 2864 | 7.64 | 87 | 91.23 |
FGT | 3164 | 8.81 | 85 | 97.51 |
MMFGT | 3595 | 11.44 | 97 | 98.50 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhang, Y.; Chen, L.; Yuan, Y. Multimodal Fine-Grained Transformer Model for Pest Recognition. Electronics 2023, 12, 2620. https://doi.org/10.3390/electronics12122620
Zhang Y, Chen L, Yuan Y. Multimodal Fine-Grained Transformer Model for Pest Recognition. Electronics. 2023; 12(12):2620. https://doi.org/10.3390/electronics12122620
Chicago/Turabian StyleZhang, Yinshuo, Lei Chen, and Yuan Yuan. 2023. "Multimodal Fine-Grained Transformer Model for Pest Recognition" Electronics 12, no. 12: 2620. https://doi.org/10.3390/electronics12122620
APA StyleZhang, Y., Chen, L., & Yuan, Y. (2023). Multimodal Fine-Grained Transformer Model for Pest Recognition. Electronics, 12(12), 2620. https://doi.org/10.3390/electronics12122620