LCV2: A Universal Pretraining-Free Framework for Grounded Visual Question Answering
Abstract
:1. Introduction
- (i)
- This study introduced LCV2, a modular framework specifically designed for the VQA grounding task. It eliminates the need for pretraining, reducing the demand for extensive computational power and data resources.
- (ii)
- LCV2 utilizes a LLM to transform question–answer pair texts into descriptive referring texts suitable for visual grounding. This addresses the issue of insufficient interaction between the visual questioning and grounding in modular frameworks.
- (iii)
- The modules within this universal framework are designed to be plug-and-play and compatible with advanced pretrained models, allowing for dynamic performance improvements as the technology evolves.
- (iv)
2. Related Work
2.1. Visual Question Answering and VQA Grounding
2.2. Open-Vocabulary Object Detection and Referring Expression Comprehension
2.3. Large Language Models
3. Methods
3.1. Problem Definition
3.2. Modular Framework
3.3. Modular Inventory
3.4. Framework Inference
4. Experiments
4.1. Datasets
4.2. Implementation Details
4.3. Evaluation Metrics
4.4. Comparison with Baseline Models
4.5. Impact of the VQA Modules on Results
4.6. LCV2 on the VizWiz Answer Grounding
5. Analysis and Discussion
5.1. LCV2 Compared to Baseline Methods
5.2. Impact of the VQA Module
5.3. LCV2 on the VizWiz Answer Grounding
6. Conclusions and Prospects
7. Limitations
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A
Methods | Description | Advantages and Limitations |
---|---|---|
MAC-Caps [7] | Integrating a capsule network module with a query selection mechanism into the MAC VQA systems improved the model’s answer grounding capabilities. | Pros: The introduced capsule network module enhances the attention and comprehension of the visual information mentioned in the question. Cons: The grounding of visual cues is provided by attention maps, which are subject to attentional interference and lack of fine granularity, affecting the accuracy of visual cue grounding. |
DaVI [5] | The framework is a unified end-to-end system that employs a vision-based language encoder and a language-based vision decoder to generate answers and provide visual evidence based on visual question answering. | Pros: The approach integrates visual understanding and language generation, providing a unified solution that simultaneously addresses visual question answering and the visual evidence grounding of answers, enhancing the model’s overall performance. Cons: The performance of the model depends on large-scale multimodal pretraining data, limiting its effectiveness in specific domains or on small datasets. |
XMETER [60] | By integrating monolingual pretraining and adapter strategies, advanced English visual question answering models are extended to low-resource languages, generating corresponding bounding boxes for key phrases in multilingual questions. | Pros: The model effectively addresses the challenges of visual question answering tasks in low-resource language environments, demonstrating strong adaptability, efficiency, and performance. Cons: For questions involving multi-level logical reasoning (such as relational problems), the model lacks the robust semantic understanding required to fully grasp the task. |
DDTN [10] | A dual-decoder transformer network is proposed, which efficiently predicts language answers and corresponding visual instance grounding in visual question answering by integrating region and grid features and employing an instance query-guided feature processing strategy. | Pros: The model employs a unique dual-decoder design, which facilitates the separate handling of language comprehension and visual grounding tasks. Cons: The model has a limited ability to precisely ground and segment objects with complex contours and complex backgrounds. |
P2G [61] | The reasoning process is enhanced with multimodal cues by utilizing external expert agents to perform the real-time grounding of key visual and textual objects. | Pros: Based on agent queries of visual or textual cues, the P2G model can perform more purposeful reasoning. Cons: For extremely complex or atypical scenarios, such as densely stacked text or highly abstract visual elements, the model still struggles with accurate understanding and reasoning. |
LCV2 | The framework utilizes a LLM to connect the VQA and VG modules, based on a modular design, to circumvent the challenges posed by extensive pretraining in modeling. | Pros: Leveraging the generalizable knowledge of expert models allows for out-of-the-box functionality without the need for any further training in modeling. Cons: The non-end-to-end design may compromise the depth and consistency of the cross-modal understanding to some extent. |
References
- Lu, S.; Liu, M.; Yin, L.; Yin, Z.; Liu, X.; Zheng, W. The multi-modal fusion in visual question answering: A review of attention mechanisms. PeerJ Comput. Sci. 2023, 9, e1400. [Google Scholar] [CrossRef] [PubMed]
- Chen, C.; Anjum, S.; Gurari, D. Grounding answers for visual questions asked by visually impaired people. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 19098–19107. [Google Scholar]
- Massiceti, D.; Anjum, S.; Gurari, D. VizWiz grand challenge workshop at CVPR 2022. ACM SIGACCESS Access. Comput. 2022, 1. [Google Scholar] [CrossRef]
- Zeng, X.; Wang, Y.; Chiu, T.-Y.; Bhattacharya, N.; Gurari, D. Vision skills needed to answer visual questions. Proc. ACM Hum. Comput. Interact. 2020, 4, 149. [Google Scholar] [CrossRef]
- Liu, Y.; Pan, J.; Wang, Q.; Chen, G.; Nie, W.; Zhang, Y.; Gao, Q.; Hu, Q.; Zhu, P. Weakly-Supervised Grounding for VQA with Dual Visual-Linguistic Interaction. In Proceedings of the CAAI International Conference on Artificial Intelligence, Fuzhou, China, 22–23 July 2023; pp. 156–169. [Google Scholar]
- Xiao, J.; Yao, A.; Li, Y.; Chua, T.S. Can I trust your answer? visually grounded video question answering. arXiv 2023, arXiv:2309.01327. [Google Scholar]
- Urooj, A.; Kuehne, H.; Duarte, K.; Gan, C.; Lobo, N.; Shah, M. Found a reason for me? weakly-supervised grounded visual question answering using capsules. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, Nashville, TN, USA, 19–25 June 2021; pp. 8465–8474. [Google Scholar]
- Khan, A.U.; Kuehne, H.; Gan, C.; Lobo, N.D.V.; Shah, M. Weakly supervised grounding for VQA in vision-language transformers. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; pp. 652–670. [Google Scholar]
- Le, T.M.; Le, V.; Gupta, S.; Venkatesh, S.; Tran, T. Guiding visual question answering with attention priors. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–7 January 2023; pp. 4381–4390. [Google Scholar]
- Zhu, L.; Peng, L.; Zhou, W.; Yang, J. Dual-decoder transformer network for answer grounding in visual question answering. Pattern Recogn. Lett. 2023, 171, 53–60. [Google Scholar] [CrossRef]
- Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
- Sherstinsky, A. Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network. Phys. D 2020, 404, 132306. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Malinowski, M.; Fritz, M. A multi-world approach to question answering about real-world scenes based on uncertain input. arXiv 2014, arXiv:1410.0210. [Google Scholar]
- Ren, M.; Kiros, R.; Zemel, R. Image question answering: A visual semantic embedding model and a new dataset. Proc. Adv. Neural Inf. Process. Syst. 2015, 1, 5. [Google Scholar]
- Yu, Z.; Yu, J.; Fan, J.; Tao, D. Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1821–1830. [Google Scholar]
- Ben-Younes, H.; Cadene, R.; Cord, M.; Thome, N. MUTAN: Multimodal tucker fusion for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2612–2620. [Google Scholar]
- Lu, J.; Batra, D.; Parikh, D.; Lee, S. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019; Volume 32. [Google Scholar]
- Li, L.H.; Yatskar, M.; Yin, D.; Hsieh, C.-J.; Chang, K.-W. Visualbert: A simple and performant baseline for vision and language. arXiv 2019, arXiv:1908.03557. [Google Scholar]
- Wang, J.; Yang, Z.; Hu, X.; Li, L.; Lin, K.; Gan, Z.; Liu, Z.; Liu, C.; Wang, L. Git: A generative image-to-text transformer for vision and language. arXiv 2022, arXiv:2205.14100. [Google Scholar]
- Li, J.; Li, D.; Xiong, C.; Hoi, S. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; pp. 12888–12900. [Google Scholar]
- Li, J.; Li, D.; Savarese, S.; Hoi, S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 19730–19742. [Google Scholar]
- Alayrac, J.-B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M. Flamingo: A visual language model for few-shot learning. Adv. Neural Inf. Process. Syst. 2022, 35, 23716–23736. [Google Scholar]
- Zhang, Y.; Zhang, R.; Gu, J.; Zhou, Y.; Lipka, N.; Yang, D.; Sun, T. Llavar: Enhanced visual instruction tuning for text-rich image understanding. arXiv 2023, arXiv:2306.17107. [Google Scholar]
- Zhu, D.; Chen, J.; Shen, X.; Li, X.; Elhoseiny, M. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv 2023, arXiv:2304.10592. [Google Scholar]
- Zareian, A.; Rosa, K.D.; Hu, D.H.; Chang, S.-F. Open-vocabulary object detection using captions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, Nashville, TN, USA, 19–25 June 2021; pp. 14393–14402. [Google Scholar]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
- Li, L.H.; Zhang, P.; Zhang, H.; Yang, J.; Li, C.; Zhong, Y.; Wang, L.; Yuan, L.; Zhang, L.; Hwang, J.-N. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 10965–10975. [Google Scholar]
- Yao, L.; Han, J.; Wen, Y.; Liang, X.; Xu, D.; Zhang, W.; Li, Z.; Xu, C.; Xu, H. Detclip: Dictionary-enriched visual-concept paralleled pre-training for open-world detection. Adv. Neural Inf. Process. Syst. 2022, 35, 9125–9138. [Google Scholar]
- Yu, L.; Lin, Z.; Shen, X.; Yang, J.; Lu, X.; Bansal, M.; Berg, T.L. Mattnet: Modular attention network for referring expression comprehension. In Proceedings of the IEEE conference on computer vision and pattern recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1307–1315. [Google Scholar]
- Hong, R.; Liu, D.; Mo, X.; He, X.; Zhang, H. Learning to compose and reason with language tree structures for visual grounding. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 44, 684–696. [Google Scholar] [CrossRef]
- Shi, F.; Gao, R.; Huang, W.; Wang, L. Dynamic MDETR: A dynamic multimodal transformer decoder for visual grounding. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 46, 1181–1198. [Google Scholar] [CrossRef] [PubMed]
- Yang, Z.; Chen, T.; Wang, L.; Luo, J. Improving one-stage visual grounding by recursive sub-query construction. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XIV 16. pp. 387–404. [Google Scholar]
- Zhu, C.; Zhou, Y.; Shen, Y.; Luo, G.; Pan, X.; Lin, M.; Chen, C.; Cao, L.; Sun, X.; Ji, R. Seqtr: A simple yet universal network for visual grounding. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 598–615. [Google Scholar]
- Subramanian, S.; Merrill, W.; Darrell, T.; Gardner, M.; Singh, S.; Rohrbach, A. Reclip: A strong zero-shot baseline for referring expression comprehension. arXiv 2022, arXiv:2204.05991. [Google Scholar]
- He, R.; Cascante-Bonilla, P.; Yang, Z.; Berg, A.C.; Ordonez, V. Improved Visual Grounding through Self-Consistent Explanations. arXiv 2023, arXiv:2312.04554. [Google Scholar]
- Gan, Z.; Chen, Y.-C.; Li, L.; Zhu, C.; Cheng, Y.; Liu, J. Large-scale adversarial training for vision-and-language representation learning. Adv. Neural Inf. Process. Syst. 2020, 33, 6616–6628. [Google Scholar]
- Chen, Y.-C.; Li, L.; Yu, L.; El Kholy, A.; Ahmed, F.; Gan, Z.; Cheng, Y.; Liu, J. Uniter: Universal image-text representation learning. In Proceedings of the European Conference on Computer Vision, Virtual, Glasgow, UK, 23–28 August 2020; pp. 104–120. [Google Scholar]
- Yan, B.; Jiang, Y.; Wu, J.; Wang, D.; Luo, P.; Yuan, Z.; Lu, H. Universal instance perception as object discovery and retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 15325–15336. [Google Scholar]
- Liu, J.; Ding, H.; Cai, Z.; Zhang, Y.; Satzoda, R.K.; Mahadevan, V.; Manmatha, R. Polyformer: Referring image segmentation as sequential polygon generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 18653–18663. [Google Scholar]
- Xuan, S.; Guo, Q.; Yang, M.; Zhang, S. Pink: Unveiling the power of referential comprehension for multi-modal llms. arXiv 2023, arXiv:2310.00582. [Google Scholar]
- Lu, J.; Clark, C.; Zellers, R.; Mottaghi, R.; Kembhavi, A. Unified-io: A unified model for vision, language, and multi-modal tasks. In Proceedings of the Eleventh International Conference on Learning Representations, Virtual, 25–29 April 2022. [Google Scholar]
- Wang, P.; Yang, A.; Men, R.; Lin, J.; Bai, S.; Li, Z.; Ma, J.; Zhou, C.; Zhou, J.; Yang, H. OFA: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In Proceedings of the International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022; pp. 23318–23340. [Google Scholar]
- Chung, H.W.; Hou, L.; Longpre, S.; Zoph, B.; Tay, Y.; Fedus, W.; Li, Y.; Wang, X.; Dehghani, M.; Brahma, S. Scaling instruction-finetuned language models. arXiv 2022, arXiv:2210.11416. [Google Scholar]
- Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
- Ye, J.; Chen, X.; Xu, N.; Zu, C.; Shao, Z.; Liu, S.; Cui, Y.; Zhou, Z.; Gong, C.; Shen, Y. A comprehensive capability analysis of gpt-3 and gpt-3.5 series models. arXiv 2023, arXiv:2303.10420. [Google Scholar]
- Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
- Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S. Llama 2: Open foundation and fine-tuned chat models. arXiv 2023, arXiv:2307.09288. [Google Scholar]
- Sun, Y.; Wang, S.; Feng, S.; Ding, S.; Pang, C.; Shang, J.; Liu, J.; Chen, X.; Zhao, Y.; Lu, Y. Ernie 3.0: Large-scale knowledge enhanced pre-training for language understanding and generation. arXiv 2021, arXiv:2107.02137. [Google Scholar]
- Bai, J.; Bai, S.; Chu, Y.; Cui, Z.; Dang, K.; Deng, X.; Fan, Y.; Ge, W.; Han, Y.; Huang, F. Qwen technical report. arXiv 2023, arXiv:2309.16609. [Google Scholar]
- Chowdhery, A.; Narang, S.; Devlin, J.; Bosma, M.; Mishra, G.; Roberts, A.; Barham, P.; Chung, H.W.; Sutton, C.; Gehrmann, S. Palm: Scaling language modeling with pathways. J. Mach. Learn. Res. 2023, 24, 1–113. [Google Scholar]
- Zeng, A.; Liu, X.; Du, Z.; Wang, Z.; Lai, H.; Ding, M.; Yang, Z.; Xu, Y.; Zheng, W.; Xia, X. Glm-130b: An open bilingual pre-trained model. arXiv 2022, arXiv:2210.02414. [Google Scholar]
- Hudson, D.A.; Manning, C.D. GQA: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 6700–6709. [Google Scholar]
- Johnson, J.; Hariharan, B.; Van Der Maaten, L.; Fei-Fei, L.; Lawrence Zitnick, C.; Girshick, R. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2901–2910. [Google Scholar]
- Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Chen, C.; Anjum, S.; Gurari, D. VQA Therapy: Exploring Answer Differences by Visually Grounding Answers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 15315–15325. [Google Scholar]
- Hudson, D.A.; Manning, C.D. Compositional attention networks for machine reasoning. arXiv 2018, arXiv:1803.03067. [Google Scholar]
- Pan, J.; Chen, G.; Liu, Y.; Wang, J.; Bian, C.; Zhu, P.; Zhang, Z. Tell me the evidence? Dual visual-linguistic interaction for answer grounding. arXiv 2022, arXiv:2207.05703. [Google Scholar]
- Wang, Y.; Pfeiffer, J.; Carion, N.; LeCun, Y.; Kamath, A. Adapting Grounded Visual Question Answering Models to Low Resource Languages. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 2595–2604. [Google Scholar]
- Chen, J.; Liu, Y.; Li, D.; An, X.; Feng, Z.; Zhao, Y.; Xie, Y. Plug-and-Play Grounding of Reasoning in Multimodal Large Language Models. arXiv 2024, arXiv:2403.19322. [Google Scholar]
- Dou, Z.-Y.; Kamath, A.; Gan, Z.; Zhang, P.; Wang, J.; Li, L.; Liu, Z.; Liu, C.; LeCun, Y.; Peng, N. Coarse-to-fine vision-language pre-training with fusion in the backbone. Adv. Neural Inf. Process. Syst. 2022, 35, 32942–32956. [Google Scholar]
- Liu, S.; Zeng, Z.; Ren, T.; Li, F.; Zhang, H.; Yang, J.; Li, C.; Yang, J.; Su, H.; Zhu, J. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv 2023, arXiv:2303.05499. [Google Scholar]
- Xie, C.; Zhang, Z.; Wu, Y.; Zhu, F.; Zhao, R.; Liang, S. Described Object Detection: Liberating Object Detection with Flexible Expressions. arXiv 2024, arXiv:2307.12813. [Google Scholar]
- Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving Language Understanding by Generative Pre-Training. 2018. Available online: https://www.mikecaptain.com/resources/pdf/GPT-1.pdf (accessed on 16 January 2024).
- Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
- Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
- Zhang, S.; Roller, S.; Goyal, N.; Artetxe, M.; Chen, M.; Chen, S.; Dewan, C.; Diab, M.; Li, X.; Lin, X.V. Opt: Open pre-trained transformer language models. arXiv 2022, arXiv:2205.01068. [Google Scholar]
- Berrios, W.; Mittal, G.; Thrush, T.; Kiela, D.; Singh, A. Towards language models that can see: Computer vision through the lens of natural language. arXiv 2023, arXiv:2306.16410. [Google Scholar]
- GQA: Visual Reasoning in the Real World—Stanford University. Available online: https://cs.stanford.edu/people/dorarad/gqa/download.html (accessed on 6 January 2024).
- Goyal, Y.; Khot, T.; Summers-Stay, D.; Batra, D.; Parikh, D. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6904–6913. [Google Scholar]
- Answer Grounding for VQA—VizWiz. Available online: https://vizwiz.org/tasks-and-datasets/answer-grounding-for-vqa/ (accessed on 12 January 2024).
- Hu, R.; Andreas, J.; Darrell, T.; Saenko, K. Explainable neural computation via stack neural module networks. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 53–69. [Google Scholar]
- Billa, J.G.; Oh, M.; Du, L. Supervisory Prompt Training. arXiv 2024, arXiv:2403.18051 2024. [Google Scholar]
- De Zarzà, I.; de Curtò, J.; Calafate, C.T. Socratic video understanding on unmanned aerial vehicles. Procedia Comput. Sci. 2023, 225, 144–154. [Google Scholar] [CrossRef]
- Bai, Z.; Wang, R.; Chen, X. Glance and Focus: Memory Prompting for Multi-Event Video Question Answering. arXiv 2024, arXiv:2401.01529. [Google Scholar]
- Wang, X.; Ma, W.; Li, Z.; Kortylewski, A.; Yuille, A.L. 3D-Aware Visual Question Answering about Parts, Poses and Occlusions. arXiv 2024, arXiv:2310.17914. [Google Scholar]
Models | Obj. | Acc. | IoU | Overlap | ||||
---|---|---|---|---|---|---|---|---|
Precision | Recall | F1 Score | Precision | Recall | F1 Score | |||
MAC [58] | A | 0.571 | 0.009 | 0.045 | 0.015 | 0.056 | 0.274 | 0.093 |
MAC-Caps [7] | 0.551 | 0.023 | 0.119 | 0.039 | 0.120 | 0.626 | 0.201 | |
LCV2 (swin-T) | 0.566 | 0.273 | 0.637 | 0.382 | 0.372 | 0.786 | 0.505 | |
LCV2 (swin-B) | 0.566 | 0.323 | 0.590 | 0.417 | 0.497 | 0.805 | 0.614 | |
MAC [58] | All | 0.571 | 0.037 | 0.043 | 0.040 | 0.250 | 0.305 | 0.275 |
MAC-Caps [7] | 0.551 | 0.070 | 0.087 | 0.078 | 0.461 | 0.623 | 0.530 | |
LCV2 (swin-T) | 0.566 | 0.515 | 0.707 | 0.596 | 0.751 | 0.894 | 0.816 | |
LCV2 (swin-B) | 0.566 | 0.516 | 0.659 | 0.578 | 0.763 | 0.856 | 0.807 |
Models | T | Acc. | IoU | Overlap | ||||
---|---|---|---|---|---|---|---|---|
Precision | Recall | F1 Score | Precision | Recall | F1 Score | |||
MAC [58] | 4 | 0.977 | 0.140 | 0.335 | 0.197 | 0.249 | 0.563 | 0.346 |
MAC-Caps [7] | 0.968 | 0.240 | 0.391 | 0.297 | 0.470 | 0.731 | 0.572 | |
MAC [58] | 6 | 0.980 | 0.126 | 0.236 | 0.164 | 0.301 | 0.524 | 0.382 |
MAC-Caps [7] | 0.980 | 0.290 | 0.476 | 0.361 | 0.485 | 0.798 | 0.603 | |
MAC [58] | 12 | 0.985 | 0.085 | 0.181 | 0.116 | 0.287 | 0.533 | 0.373 |
MAC-Caps [7] | 0.979 | 0.277 | 0.498 | 0.356 | 0.509 | 0.946 | 0.662 | |
SNMN [73] | 9 | 0.962 | 0.378 | 0.475 | 0.421 | 0.529 | 0.670 | 0.591 |
SNMN-Caps [7] | 0.967 | 0.506 | 0.518 | 0.512 | 0.738 | 0.781 | 0.759 | |
LCV2 (swin-T) | - | 0.367 | 0.265 | 0.577 | 0.363 | 0.418 | 0.785 | 0.545 |
LCV2 (swin-B) | 0.367 | 0.296 | 0.425 | 0.349 | 0.492 | 0.660 | 0.564 |
Models | Obj. | Acc. | IoU | Overlap | ||||
---|---|---|---|---|---|---|---|---|
Precision | Recall | F1 Score | Precision | Recall | F1 Score | |||
LCV2 (Blip-L) | A | 0.566 | 0.323 | 0.590 | 0.417 | 0.497 | 0.805 | 0.614 |
LCV2 (lens) | 0.278 | 0.261 | 0.505 | 0.345 | 0.491 | 0.795 | 0.607 | |
LCV2 (Git) | 0.518 | 0.292 | 0.545 | 0.380 | 0.463 | 0.770 | 0.579 | |
LCV2 (Blip-L) | All | 0.566 | 0.516 | 0.659 | 0.578 | 0.763 | 0.856 | 0.807 |
LCV2 (lens) | 0.278 | 0.414 | 0.612 | 0.494 | 0.692 | 0.858 | 0.776 | |
LCV2 (Git) | 0.518 | 0.506 | 0.649 | 0.568 | 0.756 | 0.852 | 0.801 |
Models | FLOPs | Params |
---|---|---|
BLIP-VQA-Capfilt-large | 59.063 G | 336.557 M |
Git-large-VQAv2 | 333.409 G | 369.705 M |
Lens | 3377.488 G | 1077.052 M |
Flan-T5-large | 13.075 G | 750.125 M |
Grounding DINO swin-T | 39.177 G | 144.140 M |
Grounding DINO swin-B | 58.812 G | 204.028 M |
LCV2 (Blip-L) | 130.950 G | 1290.710 M |
LCV2 (lens) | 3449.375 G | 2031.205 M |
LCV2 (Git) | 405.296 G | 1323.858 M |
Models | Acc. | IoU | Overlap | ||||
---|---|---|---|---|---|---|---|
Precision | Recall | F1 Score | Precision | Recall | F1 Score | ||
LCV2 (swin-T) | 0.367 | 0.265 | 0.577 | 0.363 | 0.418 | 0.785 | 0.545 |
LCV2 (swin-B) | 0.367 | 0.296 | 0.425 | 0.349 | 0.492 | 0.660 | 0.564 |
LCV2 (finetuned-Blip, swin-T) | 0.773 | 0.273 | 0.596 | 0.374 | 0.424 | 0.801 | 0.554 |
LCV2 (finetuned-Blip, swin-B) | 0.773 | 0.312 | 0.425 | 0.360 | 0.512 | 0.662 | 0.577 |
Year | Team | IoU |
---|---|---|
2022 | Aurora (ByteDance and Tianjin University) | 0.71 |
hsslab_inspur | 0.70 | |
Pinkiepie | 0.33 | |
binggo | 0.08 | |
2023 | UD VIMS Lab (EAB) | 0.74 |
MGTV_Baseline | 0.72 | |
DeepBlue_AI | 0.69 | |
USTC | 0.46 | |
ours | 0.43 |
Models | IoU |
---|---|
LCV2 (lens-swinB) | 0.424 |
LCV2 (Gittext-swinB) | 0.405 |
LCV2 (BlipL-swinB) | 0.425 |
LCV2 (BlipL-swinT) | 0.417 |
LCV2 (FineTunedBlipL-swinB) | 0.430 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Chen, Y.; Su, L.; Chen, L.; Lin, Z. LCV2: A Universal Pretraining-Free Framework for Grounded Visual Question Answering. Electronics 2024, 13, 2061. https://doi.org/10.3390/electronics13112061
Chen Y, Su L, Chen L, Lin Z. LCV2: A Universal Pretraining-Free Framework for Grounded Visual Question Answering. Electronics. 2024; 13(11):2061. https://doi.org/10.3390/electronics13112061
Chicago/Turabian StyleChen, Yuhan, Lumei Su, Lihua Chen, and Zhiwei Lin. 2024. "LCV2: A Universal Pretraining-Free Framework for Grounded Visual Question Answering" Electronics 13, no. 11: 2061. https://doi.org/10.3390/electronics13112061
APA StyleChen, Y., Su, L., Chen, L., & Lin, Z. (2024). LCV2: A Universal Pretraining-Free Framework for Grounded Visual Question Answering. Electronics, 13(11), 2061. https://doi.org/10.3390/electronics13112061