A Simple Framework for Scene Graph Reasoning with Semantic Understanding of Complex Sentence Structure
Abstract
:1. Introduction
- We propose a simple yet effective scene graph reasoning framework using only the multimodal transformer structures without any additional structure for understanding the scene graph. The proposed method shows a significant effect in GQA tasks that require scene graph reasoning capability.
- We also propose a multi-task learning method that can effectively understand queries with complex structures composed of multiple phrases. Multi-task learning takes the classification problem for the existing GQA as a main task and uses the sentence pair classification problem as a secondary task to learn the validity of the grammatical structure of sentences.
- We have conducted extensive experiments to evaluate the effectiveness in terms of task capabilities, ablation studies, and generalization.
2. Related Works
3. Proposed Architecture
3.1. Input Representation
3.1.1. Shuffled Question
3.1.2. Triples
3.1.3. Image
3.2. Multimodal Transformers
3.3. Multi-Task Learning for Question Understanding
4. Experiment
4.1. Experimental Setup
4.1.1. Dataset
4.1.2. Baselines
- LXMERT [18]: one of the two-stream multimodal transformers pre-trained on large-scale image–text pair data for learning both intra-modality and cross-modality relationship.
- OSCAR [19]: one of the one-steam multimodal transformers pre-trained on large-scale image–text pair data with multi-task learning in order to train the alignment of image–text pairs.
4.1.3. Implementation Details
4.2. Experimental Result
5. Discussion
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Turk, M. Multimodal interaction: A review. Pattern Recognit. Lett. 2014, 36, 189–195. [Google Scholar] [CrossRef]
- Matveev, Y.; Matveev, A.; Frolova, O.; Lyakso, E.; Ruban, N. Automatic Speech Emotion Recognition of Younger School Age Children. Mathematics 2022, 10, 2373. [Google Scholar] [CrossRef]
- Zgank, A. Influence of Highly Inflected Word Forms and Acoustic Background on the Robustness of Automatic Speech Recognition for Human–Computer Interaction. Mathematics 2022, 10, 711. [Google Scholar] [CrossRef]
- Mokady, R.; Hertz, A.; Bermano, A.H. ClipCap: CLIP Prefix for Image Captioning. arXiv 2021, arXiv:2111.09734. [Google Scholar]
- Wang, J.; Yang, Z.; Hu, X.; Li, L.; Lin, K.; Gan, Z.; Liu, Z.; Liu, C.; Wang, L. Git: A generative image-to-text transformer for vision and language. arXiv 2022, arXiv:2205.14100. [Google Scholar]
- Hu, X.; Gan, Z.; Wang, J.; Yang, Z.; Liu, Z.; Lu, Y.; Wang, L. Scaling up vision-language pre-training for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 17980–17989. [Google Scholar]
- Aafaq, N.; Akhtar, N.; Liu, W.; Gilani, S.Z.; Mian, A. Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 12487–12496. [Google Scholar]
- Li, L.; Lei, J.; Gan, Z.; Yu, L.; Chen, Y.C.; Pillai, R.; Cheng, Y.; Zhou, L.; Wang, X.E.; Wang, W.Y.; et al. Value: A multi-task benchmark for video-and-language understanding evaluation. arXiv 2021, arXiv:2106.04632. [Google Scholar]
- Liu, S.; Ren, Z.; Yuan, J. Sibnet: Sibling convolutional encoder for video captioning. In Proceedings of the 26th ACM International Conference on Multimedia, Seoul, Republic of Korea, 22–26 October 2018; pp. 1425–1434. [Google Scholar]
- Alamri, H.; Cartillier, V.; Lopes, R.G.; Das, A.; Wang, J.; Essa, I.; Batra, D.; Parikh, D.; Cherian, A.; Marks, T.K.; et al. Audio Visual Scene-Aware Dialog (AVSD) Challenge at DSTC7. arXiv 2018, arXiv:1806.00525. [Google Scholar]
- He, L.; Liu, S.; An, R.; Zhuo, Y.; Tao, J. An End-to-End Framework Based on Vision-Language Fusion for Remote Sensing Cross-Modal Text-Image Retrieval. Mathematics 2023, 11, 2279. [Google Scholar] [CrossRef]
- Zhang, P.; Li, X.; Hu, X.; Yang, J.; Zhang, L.; Wang, L.; Choi, Y.; Gao, J. Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 5579–5588. [Google Scholar]
- OpenAI. GPT-4 Technical Report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
- Zhang, D.; Ren, A.; Liang, J.; Liu, Q.; Wang, H.; Ma, Y. Improving Medical X-ray Report Generation by Using Knowledge Graph. Appl. Sci. 2022, 12, 11111. [Google Scholar] [CrossRef]
- Ramesh, V.; Chi, N.A.; Rajpurkar, P. Improving Radiology Report Generation Systems by Removing Hallucinated References to Non-existent Priors. arXiv 2022, arXiv:2210.06340. [Google Scholar]
- Sharma, D.; Purushotham, S.; Reddy, C.K. MedFuseNet: An attention-based multimodal deep learning model for visual question answering in the medical domain. Sci. Rep. 2021, 11, 19826. [Google Scholar] [CrossRef] [PubMed]
- Goyal, Y.; Khot, T.; Summers-Stay, D.; Batra, D.; Parikh, D. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6904–6913. [Google Scholar]
- Tan, H.; Bansal, M. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 5100–5111. [Google Scholar] [CrossRef]
- Li, X.; Yin, X.; Li, C.; Zhang, P.; Hu, X.; Zhang, L.; Wang, L.; Hu, H.; Dong, L.; Wei, F.; et al. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 121–137. [Google Scholar]
- Lu, J.; Batra, D.; Parikh, D.; Lee, S. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Proceedings of the 33rd International Conference on Neural Information Processing System, Vancouver, BC, Canada, 8–14 December 2019; Voulme 32. [Google Scholar]
- Li, L.H.; Yatskar, M.; Yin, D.; Hsieh, C.J.; Chang, K.W. Visualbert: A simple and performant baseline for vision and language. arXiv 2019, arXiv:1908.03557. [Google Scholar]
- Socher, R.; Ganjoo, M.; Manning, C.D.; Ng, A. Zero-shot learning through cross-modal transfer. arXiv 2013, arXiv:1301.3666. [Google Scholar]
- Zhou, L.; Palangi, H.; Zhang, L.; Hu, H.; Corso, J.; Gao, J. Unified vision-language pre-training for image captioning and vqa. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 13041–13049. [Google Scholar]
- Chen, Y.C.; Li, L.; Yu, L.; Kholy, A.E.; Ahmed, F.; Gan, Z.; Cheng, Y.; Liu, J. UNITER: UNiversal Image-TExt Representation Learning. arXiv 2020, arXiv:1909.11740. [Google Scholar]
- Li, G.; Duan, N.; Fang, Y.; Gong, M.; Jiang, D. Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 11336–11344. [Google Scholar]
- Hudson, D.A.; Manning, C.D. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 6700–6709. [Google Scholar]
- Liang, W.; Jiang, Y.; Liu, Z. GraghVQA: Language-Guided Graph Neural Networks for Graph-based Visual Question Answering. In Proceedings of the Third Workshop on Multimodal Artificial Intelligence, Mexico City, Mexico, 1–5 June 2021; pp. 79–86. [Google Scholar] [CrossRef]
- Hudson, D.A.; Manning, C.D. Compositional Attention Networks for Machine Reasoning. arXiv 2018, arXiv:1803.03067. [Google Scholar]
- Kim, E.S.; Kang, W.Y.; On, K.W.; Heo, Y.J.; Zhang, B.T. Hypergraph Attention Networks for Multimodal Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
- Yao, T.; Pan, Y.; Li, Y.; Mei, T. Exploring Visual Relationship for Image Captioning. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 684–699. [Google Scholar]
- Yang, X.; Tang, K.; Zhang, H.; Cai, J. Auto-Encoding Scene Graphs for Image Captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 10685–10694. [Google Scholar]
- Girshick, R. Fast r-cnn. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
- Shi, Z. Image Semantic Analysis and Understanding. In Proceedings of the International Conference on Intelligent Information Processing, Manchester, UK, 13–16 October 2010; Shi, Z., Vadera, S., Aamodt, A., Leake, D., Eds.; Springer: Berlin/Heidelberg, Germany, 2010; pp. 4–5. [Google Scholar]
- Sun, G.; Wang, W.; Dai, J.; Gool, L.V. Mining Cross-Image Semantics for Weakly Supervised Semantic Segmentation. arXiv 2020, arXiv:2007.01947. [Google Scholar]
- Krishna, R.; Zhu, Y.; Groth, O.; Johnson, J.; Hata, K.; Kravitz, J.; Chen, S.; Kalantidis, Y.; Li, L.J.; Shamma, D.A.; et al. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. arXiv 2016, arXiv:1602.07332. [Google Scholar] [CrossRef]
- Pham, K.; Kafle, K.; Lin, Z.; Ding, Z.; Cohen, S.; Tran, Q.; Shrivastava, A. Learning to Predict Visual Attributes in the Wild. arXiv 2021, arXiv:2106.09707. [Google Scholar]
- Kamath, A.; Singh, M.; LeCun, Y.; Synnaeve, G.; Misra, I.; Carion, N. MDETR—Modulated Detection for End-to-End Multi-Modal Understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 1780–1790. [Google Scholar]
- Gardner, M.; Grus, J.; Neumann, M.; Tafjord, O.; Dasigi, P.; Liu, N.F.; Peters, M.; Schmitz, M.; Zettlemoyer, L.S. AllenNLP: A Deep Semantic Natural Language Processing Platform. arXiv 2017, arXiv:1803.07640. [Google Scholar]
- Caruana, R. Multitask learning. Mach. Learn. 1997, 28, 41–75. [Google Scholar] [CrossRef]
- Zhang, Y.; Yang, Q. A Survey on Multi-Task Learning. IEEE Trans. Knowl. Data Eng. 2022, 34, 5586–5609. [Google Scholar] [CrossRef]
- Zhang, Z.; Yu, W.; Yu, M.; Guo, Z.; Jiang, M. A Survey of Multi-task Learning in Natural Language Processing: Regarding Task Relatedness and Training Methods. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, Dubrovnik, Croatia, 2–6 May 2023; pp. 943–956. [Google Scholar]
- Clark, K.; Luong, M.T.; Le, Q.V.; Manning, C.D. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. arXiv 2020, arXiv:2003.10555. [Google Scholar]
- Suhr, A.; Zhou, S.; Zhang, A.; Zhang, I.; Bai, H.; Artzi, Y. A Corpus for Reasoning about Natural Language Grounded in Photographs. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 6418–6428. [Google Scholar] [CrossRef]
Model | LXMERT | OSCAR |
---|---|---|
batch size | 64 | 48 |
epoch | 10 | 10 |
learning rate | ||
gradient accumulation step | 2 | 1 |
optimizer | AdamW | AdamW |
scheduler | cosine-annealing | cosine-annealing |
max sequence length | 300 | 300 |
size of object features | 36 | 36 |
size of object features | 2048 | 2048 |
size of boundary features | 4 | 4 |
Two Stream | Accuracy (%) |
---|---|
LXMERT w/o SG | 58.90 |
LXMERT + SG | 79.79 |
LXMERT + SG + TS | 80.40 |
LXMERT + SG + TS + MTL | 81.47 |
One Stream | Accuracy (%) |
---|---|
OSCAR w/o SG | 60.80 |
OSCAR + SG | 80.81 |
OSCAR + SG + TS | 81.21 |
OSCAR + SG + TS + MTL | 81.91 |
Model (Two Stream) | Accuracy (%) |
---|---|
LXMERT w/o Scene Graph | 58.90 |
BERT + Scene Graph | 69.10 |
LXMERT + Scene Graph | 80.40 |
Model (LXMERT) | VQA | NLVR |
---|---|---|
LXMERT w/o SG | 70.40 | 72.8 |
LXMERT + SG + MTL | 71.58 (+1.18) | 73.39 (+0.59) |
Model (OSCAR) | VQA | NLVR |
---|---|---|
OSCAR w/o SG | 72.67 | 71.8 |
OSCAR + SG + MTL | 73.30 (+0.63) | 71.68 (+0.45) |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Heo, Y.; Kang, S. A Simple Framework for Scene Graph Reasoning with Semantic Understanding of Complex Sentence Structure. Mathematics 2023, 11, 3751. https://doi.org/10.3390/math11173751
Heo Y, Kang S. A Simple Framework for Scene Graph Reasoning with Semantic Understanding of Complex Sentence Structure. Mathematics. 2023; 11(17):3751. https://doi.org/10.3390/math11173751
Chicago/Turabian StyleHeo, Yoonseok, and Sangwoo Kang. 2023. "A Simple Framework for Scene Graph Reasoning with Semantic Understanding of Complex Sentence Structure" Mathematics 11, no. 17: 3751. https://doi.org/10.3390/math11173751
APA StyleHeo, Y., & Kang, S. (2023). A Simple Framework for Scene Graph Reasoning with Semantic Understanding of Complex Sentence Structure. Mathematics, 11(17), 3751. https://doi.org/10.3390/math11173751