Deep Modular Bilinear Attention Network for Visual Question Answering
Abstract
:1. Introduction
- In this paper, we propose a deep multimodality attention network (DMBA-NET) framework with two basic attention units (BAN-GA and BAN-SA) to construct inter-modality and intra-modality relations between visual and language features. BAN-GA and BAN-SA are the core of the whole network framework, and they can be cascaded in depth. Unlike other models, we use bilinear attention to calculate the inter-modality and intra-modality attention instead of dot-product. Our experiments show that we obtain more refined and rich features.
- We encode text information based on the dynamic word vector of BERT. Then we use multi-head self-attention to process the text features and sum them with the features obtained in the previous step, before the final classification, which further improves the model’s accuracy, indicating that this method can work together.
- We visualize the attention of the model and the experimental results, which can help us better understand the interaction between multimodal features. Extensive ablation experiments are carried out, and the experimental results show that each module in the model can play its effectiveness.
2. Related Work
2.1. Attention
2.2. High-Level Attributes and Knowledge
2.3. VQA Pre-Training
2.4. Feature Fusion
3. Deep Modular Bilinear Attention Network
3.1. Question and Image Encoding
3.2. Multi-Glimpse Bilinear Guided-Attention Network
3.3. Multi-Glimpse Bilinear Self-Attention Network
3.4. Multi-Head Self-Attention
3.5. Feature Fusion and Answer Prediction
3.6. Loss Function
4. Experiment
4.1. Datasets
4.2. Experimental Setup
4.3. Ablation Analysis
- BAN-GA [10]: denotes Bilinear Guided-Attention Networks.
- BAN-GA + BERT: represents Bilinear Guided-Attention Networks with BERT. We use BERT to encode the quesion features.
- BAN-GA + BAN-SA: represents Bilinear Guided-Attention Networks with Bilinear Self-Attention Networks.
- BAN-GA + Q-SA: represents Bilinear Attention Networks with Question Self-Attention Networks.
- BAN-GA + BERT + BAN-SA + Q-SA: represents our final model.
4.4. Qualitative Analysis
4.5. Comparison with the State-of-the-Art
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Antol, S.; Agrawal, A.; Lu, J.; Mitchell, M.; Batra, D.; Zitnick, C.L.; Parikh, D. VQA: Visual Question Answering. In Proceedings of the International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015. [Google Scholar]
- Fukui, A.; Park, D.H.; Yang, D.; Rohrbach, A.; Darrell, T.; Rohrbach, M. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA, 1–5 November 2016. [Google Scholar]
- Yu, Z.; Yu, J.; Fan, J.; Tao, D. Multi-modal Factorized Bilinear Pooling with Co-attention Learning for Visual Question Answering. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
- Ben-younes, H.; Cadene, R.; Cord, M.; Thome, N. MUTAN: Multimodal Tucker Fusion for Visual Question Answering. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
- Xu, H.; Saenko, K. Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering. In Computer Vision—ECCV 2016; Springer International Publishing: Berlin/Heidelberg, Germany, 2016; pp. 451–466. [Google Scholar]
- Sun, Q.; Fu, Y. Stacked Self-Attention Networks for Visual Question Answering. In Proceedings of the 2019 on International Conference on Multimedia Retrieval, Ottawa, ON, Canada, 10–13 June 2019. [Google Scholar]
- Chowdhury, M.I.H.; Nguyen, K.; Sridharan, S.; Fookes, C. Hierarchical Relational Attention for Video Question Answering. In Proceedings of the 2018 25th IEEE International Conference on Image Processing (ICIP), Athens, Greece, 7–10 October 2018. [Google Scholar]
- Yu, D.; Fu, J.; Mei, T.; Rui, Y. Multi-level Attention Networks for Visual Question Answering. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; Zhang, L. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
- Kim, J.-H.; Jun, J.; Zhang, B.-T. Bilinear attention networks. In Proceedings of the 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, QC, Canada, 3–8 December 2018. [Google Scholar]
- Yu, Z.; Yu, J.; Cui, Y.; Tao, D.; Tian, Q. Deep modular co-attention networks for visual question answering. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 6281–6290. [Google Scholar]
- Peng, G.; Jiang, Z.; You, H.; Lu, P.; Hoi, S.; Wang, X.; Li, H. Dynamic fusion with intra-and inter-modality attention flow for visual question answering. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 6639–6648. [Google Scholar]
- Han, D.; Zhou, S.; Li, K.C.; de Mello, R.F. Cross-modality co-attention networks for visual question answering. Soft Comput. 2021, 25, 5411–5421. [Google Scholar] [CrossRef]
- Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Wu, Q.; Wang, P.; Shen, C.; Dick, A.; Hengel, A.V.D. Ask Me Anything: Free-Form Visual Question Answering Based on Knowledge from External Sources. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Wang, P.; Wu, Q.; Shen, C.; Dick, A.; van den Hengel, A. Explicit Knowledge-based Reasoning for Visual Question Answering. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, Melbourne, Australia, 19–25 June 2017. [Google Scholar]
- Wang, P.; Wu, Q.; Shen, C.; Dick, A.; van den Hengel, A. FVQA: Fact-Based Visual Question Answering. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 2413–2427. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Wu, Q.; Shen, C.; Wang, P.; Dick, A.; van den Hengel, A. Image Captioning and Visual Question Answering Based on Attributes and External Knowledge. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 1367–1381. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Jia, D.; Wei, D.; Socher, R.; Li-Jia, L.; Kai, L.; Li, F. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009. [Google Scholar]
- Krishna, R.; Zhu, Y.; Groth, O.; Johnson, J.; Hata, K.; Kravitz, J.; Chen, S.; Kalantidis, Y.; Li, F.-F.; Shamma, D.A.; et al. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. Int. J. Comput. Vis. 2017, 123, 32–73. [Google Scholar] [CrossRef] [Green Version]
- Lu, J.; Batra, D.; Parikh, D.; Lee, S. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. arXiv 2019, arXiv:1908.02265. [Google Scholar]
- Li, L.H.; Yatskar, M.; Yin, D.; Hsieh, C.; Chang, K. VisualBERT: A Simple and Performant Baseline for Vision and Language. arXiv 2019, arXiv:1908.03557. [Google Scholar]
- Tan, H.; Bansal, M. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. arXiv 2019, arXiv:1908.07490. [Google Scholar]
- Zhou, L.; Palangi, H.; Zhang, L.; Hu, H.; Corso, J.; Gao, J. Unified Vision-Language Pre-Training for Image Captioning and VQA. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020. [Google Scholar]
- Yu, Z.; Yu, J.; Xiang, C.; Fan, J.; Tao, D. Beyond Bilinear: Generalized Multimodal Factorized High-Order Pooling for Visual Question Answering. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 5947–5959. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Kim, J.; On, K.; Kim, J.; Ha, J.; Zhang, B. Hadamard Product for Low-rank Bilinear Pooling. arXiv 2016, arXiv:1609.02907. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Church, K.W. Word2Vec. Nat. Lang. Eng. 2017, 23, 155–162. [Google Scholar] [CrossRef] [Green Version]
- Pennington, J.; Socher, R.; Manning, C.D. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
- Peters, M.E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; Zettlemoyer, L. Deep contextualized word representations. arXiv 2018, arXiv:1802.05365. [Google Scholar]
- Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
- Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
- Jiang, Y.; Natarajan, V.; Chen, X.; Rohrbach, M.; Batra, D.; Parikh, D. Pythia v0. 1: The winning entry to the vqa challenge 2018. arXiv 2018, arXiv:1807.09956. [Google Scholar]
- Zhang, Y.; Hare, J.; Prugel-Bennett, A. Learning to count objects in natural images for visual question answering. arXiv 2018, arXiv:1802.05766. [Google Scholar]
- Cadene, R.; Ben-Younes, H.; Cord, M.; Thome, N. Murel: Multimodal relational reasoning for visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019; pp. 1989–1998. [Google Scholar]
- Peng, L.; Yang, Y.; Wang, Z.; Huang, Z.; Shen, H.T. Mra-net: Improving vqa via multi-modal relation attention network. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 318–329. [Google Scholar] [CrossRef]
Model | Accuracy(%) |
---|---|
MCAN (Bottom-up image feature) | 67.17 |
MCAN (Pythia image feature) | 67.44 |
BAN-GA (Bottom-up image feature) | 66.00 |
BAN-GA (Pythia image feature) | 67.23 |
Ablation Model | Accuracy(%) |
---|---|
BAN-GA | 67.23 |
BAN-GA + BERT | 68.83 |
BAN-GA + BAN-SA | 68.81 |
BAN-GA + Q-SA | 68.90 |
DMBA-NET(our) | 69.45 |
Glimpse (Layer) | Accuracy |
---|---|
DMBA-NET-1 | 69.20 |
DMBA-NET-2 | 69.09 |
DMBA-NET-3 | 69.30 |
DMBA-NET-4 | 69.45 |
DMBA-NET-8 | 69.36 |
lr × | Accuracy |
---|---|
0.001 | 68.41 |
0.01 | 69.30 |
0.02 | 69.45 |
0.1 | 69.06 |
Method | Test-Dev (%) | Test-Std (%) | ||||||
---|---|---|---|---|---|---|---|---|
Y/N | Num | Other | Overall | Y/N | Num | Other | Overall | |
MCB [2] | 82.3 | 37.2 | 57.4 | 65.4 | - | - | - | - |
Bottom-Up [9] | 81.82 | 44.21 | 56.05 | 65.32 | 82.20 | 43.90 | 56.26 | 65.67 |
Counter [34] | 83.14 | 51.62 | 58.97 | 68.09 | 83.56 | 51.39 | 59.11 | 68.41 |
MuRel [35] | 84.77 | 49.84 | 57.85 | 68.03 | - | - | - | 68.41 |
MFB+CoAtt+GloVe+VG [3] | 84.1 | 39.1 | 58.4 | 66.9 | 84.2 | 38.1 | 57.8 | 66.6 |
Pythia v0.1 | - | - | - | 68.71 | - | - | - | - |
MFH [25] | 84.27 | 49.56 | 59.89 | 68.76 | - | - | - | - |
MRA-NET [36] | 85.58 | 48.92 | 59.46 | 69.02 | 85.83 | 49.22 | 59.86 | 69.46 |
MCAN [11] | 86.82 | 53.26 | 60.72 | 70.63 | - | - | - | 70.9 |
DFAF [12] | 86.73 | 52.92 | 61.04 | 70.59 | - | - | - | 70.81 |
CMCN [13] | 86.27 | 53.86 | 60.57 | 70.39 | - | - | - | 70.66 |
DMBA-NET(our,train+val) | 87.55 | 51.15 | 60.72 | 70.69 | 87.81 | 50.26 | 60.79 | 70.85 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Yan, F.; Silamu, W.; Li, Y. Deep Modular Bilinear Attention Network for Visual Question Answering. Sensors 2022, 22, 1045. https://doi.org/10.3390/s22031045
Yan F, Silamu W, Li Y. Deep Modular Bilinear Attention Network for Visual Question Answering. Sensors. 2022; 22(3):1045. https://doi.org/10.3390/s22031045
Chicago/Turabian StyleYan, Feng, Wushouer Silamu, and Yanbing Li. 2022. "Deep Modular Bilinear Attention Network for Visual Question Answering" Sensors 22, no. 3: 1045. https://doi.org/10.3390/s22031045
APA StyleYan, F., Silamu, W., & Li, Y. (2022). Deep Modular Bilinear Attention Network for Visual Question Answering. Sensors, 22(3), 1045. https://doi.org/10.3390/s22031045