1. Introduction
A rapidly expanding multimedia environment in recent years has led to an explosive increase in demand for multimodality that can communicate with humans in various ways [
1]. A speech recognition system [
2,
3] typically used in mobile phones and navigation systems is the easiest way to experience human interaction with computers these days. This is an artificial intelligence technology for a single modality dealing only with the human voice. On the other hand, multimodal technology is a more difficult issue that addresses multiple modalities such as language, vision, speech, motion, etc. [
4,
5,
6,
7,
8,
9,
10]. Among all the combinations, the fastest growing topic is the convergence of visual and language intelligence [
11,
12]. One typical example is multimodal large language models [
13] which contain existing text-based comprehension capabilities as well as image processing capabilities. It can create paragraph length detailed descriptions of images and converse with human beings regarding the image. Another example can be found in the medical domain. Doctors are able to leverage large-language-model-based architectures to automatically generate reports from radiographs [
14,
15] or utilize the result of system-generated interpretations of radiographs in the form of text [
16]. For this to succeed, two conditions must be satisfied. First of all, the system must have the ability to accurately understand visual information primarily. And the system must be able to make inferences based on language-oriented knowledge and visual information.
Visual Question Answering (VQA) is a typical artificial intelligence research topic that can verify whether a system contains multimodality that satisfies the two conditions. More specifically, it is a task of determining answers to natural language questions related to images, which takes both images and questions as inputs [
17]. Existing VQA models require two main components: obtaining a structured representation of the image and processing natural language questions related to the structured representation [
18,
19,
20,
21,
22,
23,
24,
25]. They have selected answers based on either the image or image segmentation as an input, but it is unknown whether they correctly capture relationships between objects represented within the image.
In order to testify whether such relationships are well understood, we mainly focus on the Graph-structured visual Question Answering (GQA) [
26] task. Compared to the traditional VQA task, this has two major differences. One is a scene graph expressing the semantics of an image in natural language, and the other is a question with a complex structure composed of several noun phrases. As shown in
Figure 1, a scene graph is a graph data structure that expresses information in an image in the form of natural language and includes three elements: objects in the image (e.g., tray, girl, salad), attributes of objects (e.g., table is wooden, napkin is white), and relations between two adjacent objects (e.g., to the right of, on, wearing). In other words, the scene graph can be regarded as a set of triples consisting of two adjacent objects with its relation described in
Figure 1. In this paper, we regard all the attributes as objects. The scene graph serves as a powerful knowledge resource for the system to understand images through linguistic intelligence, contributing to a new driving force to enhance the semantic understanding of images.
Next, since the structure of the question is usually accompanied by several noun phrases, the ability to understand the question is very important in evaluating the ability to understand visual information. This is a more complex structure compared to the questions of the typical VQA task. The question in
Figure 1 refers to the color of the tray in the image. If the system precisely understands the question and predicts the correct answer, it can indirectly evaluate how the system perceives the tray among other objects in the image.
Existing approaches [
18,
19,
20,
21,
22,
23,
24,
25] have taken as backbones of their works a family of multimodal transformers which have shown tremendous progress in visual language domains. The only difference from transformer-based language models, which have shown remarkable performance in natural language processing, is that they additionally use visual information as input. They are largely divided into two types according to the structure of visual information input. One is a one-stream framework, in which image and text features are integrated through the self-attention layer at the same time. In this paper, OSCAR [
19] is used as a representative model. On the other hand, two-stream frameworks perform multimodal fusion in stages according to modality. First, image features and text features are expressed as contextualized representations through separate self-attention layers. Then, each representation is fed into one unified self-attention layer, resulting in multimodal fusion by attending to all the knowledge in the layer. In this paper, LXMERT [
18] is used as a representative model.
Even though multimodal transformers have shown comparable performance in understanding visual linguistic information, an additional structural transformation has been required in order to process visual linguistic information accompanying the scene graph [
27,
28,
29,
30,
31]. Specifically, Yang et al. [
31] has introduced a graph-convolutional-network-based architecture in which all the graphs are transferred into latent variables and directly feed into the model. Despite great advancement in the downstream task, there is still a caveat. If there are not enough data to understand the graph structure, a cold-start problem arises in which graphs cannot be processed in completely different areas that are not in the learning stage. This is a problem that can occur in a real environment where there is no human-labeled scene graph. Therefore, it is necessary to propose a more effective scene graph processing method capable of mitigating data dependence.
Therefore, we propose a simple yet effective scene graph reasoning framework using multimodal transformers with multi-task learning to better capture the semantics of questions. As shown in
Figure 2, we adopt pre-trained multimodal transformer frameworks which were trained on a large-scale visual linguistic dataset. Unlike typical multimodal transformers, the proposed framework utilizes a scene graph as an additional input. Unlike the methods of previous studies [
27,
28,
29,
30,
31] that require a separate scene graph encoding module, the proposed framework can effectively address scene graphs only by linearizing the scene graph based on the prior knowledge of the pre-trained language model. A scene graph can be treated like natural language input, just like questions. It is a set of triples defined by two related objects in an image and their relational information. A triple (e.g., girl to the right of tray) can be considered a sentence even though the grammatical structure is incomplete. Therefore, a set of triples can be considered as a set of sentences, allowing the proposed framework with excellent language understanding ability to understand the semantics of images from text. In addition, we propose a multi-task learning method that utilizes evaluating the grammatical validity of questions as an auxiliary task to better understand a question with complex structures. This utilizes the semantic role labels of the question to randomly shuffle the sentence structure of the question. Then, the framework addresses a binary classification problem that predicts the adequacy of the sentence structure of the question. We have conducted extensive experiments to evaluate the effectiveness in terms of task capabilities, ablation studies, and generalization.
The contributions of this papercan be summarized as follows:
We propose a simple yet effective scene graph reasoning framework using only the multimodal transformer structures without any additional structure for understanding the scene graph. The proposed method shows a significant effect in GQA tasks that require scene graph reasoning capability.
We also propose a multi-task learning method that can effectively understand queries with complex structures composed of multiple phrases. Multi-task learning takes the classification problem for the existing GQA as a main task and uses the sentence pair classification problem as a secondary task to learn the validity of the grammatical structure of sentences.
We have conducted extensive experiments to evaluate the effectiveness in terms of task capabilities, ablation studies, and generalization.
The remainder of this paper is organized as follows.
Section 2 introduces related works. Then, the explanation of the proposed architecture is addressed in
Section 3. In
Section 4 and
Section 5, we describe the experiment and discussion. Then, we finalize the paper in
Section 6.
2. Related Works
In the past decade, there has been a line of research for pre-training large-scale vision language models which have shown remarkable performance of vision language tasks [
18,
19,
20,
21,
22,
23,
24,
25]. It leverages transformer architectures, called multimodal transformers, to learn cross-modal representations using the pairs of image and text. More specifically, there are basically two steps to fuse the different modalities. In order to encode visual knowledge into the same semantic space of language, a pre-trained object detection model (e.g., Faster-RCNN [
32]) is exploited to extract object-level visual features. Each object feature can be regarded as embedding for each token in the text. Then, the next step is to fuse all the knowledge into the same self-attention layer to blend textual and visual features. Based on the network architecture, multimodal transformers can be divided into two types: One type [
12,
19], called a one-stream multimodal transformer, only contains a single layer for multimodal fusion by linking all the visual and text features into a single sequence. In this paper, we adopt OSCAR [
19] as a representative model for the one-stream transformer. On the other hand, the other type [
18,
20], called a two-stream multimodal transformer, consists of two independent fusion layers. The first layer has two different self-attention layers for each modality. Every feature from each modality has been fused internally by the self-attention layer. In other words, the contextual knowledge from each modality has been processed in this layer. Then, the second layer takes as input all the contextualized representations from each modality and attends to all the knowledge regardless of modalities. In this paper, we adopt LXMERT [
18] as a representative model for the two-stream transformer.
While the quantitative performance of transformer-based visual language understanding models continues to improve, qualitative analysis is still largely a black box. Research on image semantics has been conducted even before deep learning became popular. Image semantic analysis [
33,
34] means not only detecting objects in an image, but also identifying characteristics of objects and relationships between objects. In particular, an artificial intelligence system capable of analyzing the spatial relationship between objects (e.g., up, down, around, inside) can be used for various studies such as image search and image–text mapping. In order for artificial intelligence systems to have the ability to recognize visual relationships, data with the relationships between objects labeled are essential. The Visual Genome [
35] is a large amount of data built into a graph structure by manually labeling the relationships and characteristics between objects included in static images in natural language. Therefore, graph convolution networks [
27,
36] specialized in graph data processing are used as a representative encoder architecture. Despite significant progress, the premise of needing a separate encoder for scene graph understanding is an issue that needs to be improved. Additionally, this is not useful in real environments where human-constructed scene graphs are very rare.
Therefore, in the following section, we introduce a simple yet effective scene graph reasoning framework for visual question answering using pre-trained multimodal transformers with multi-task learning to better capture the semantics of questions. Our work has two distinct differences from previous studies [
21,
37]. First of all, our work focuses on developing a multimodal transformer specialized for visual question answering tasks. More specifically, we propose a multi-task learning method in which pre-trained multimodal transformers (e.g., LXMERT, OSCAR) are fine-tuned with GQA data. On the other hand, the aim of the two existing studies [
21,
37] is to develop a pre-trained multimodal transformer with an excellent multimodal (vision language) understanding ability. That is, they are dealing with a new method for developing pre-trained multimodal transformers, such as LXMERT or OSCAR, which we use as baselines in this work.
Secondly, as a more threshold matter, our goal is to develop a visual question answering method through reasoning about the scene graph that represents the semantics of images in the form of natural language, within the multimodal transformer framework. Along with the first goal, the pre-trained multimodal transformers are fine-tuned to predict accurate answers to triples composed of an image, its scene graph, and its question.
5. Discussion
In this section, we investigate our proposed framework in terms of an ablation study and generalization for model interpretation. The ablation experiments were conducted to validate the effects of scene graphs on visual language understanding. The first result
(LXMERT + Scene Graph) in
Table 4 represents the accuracy of the proposed model without multi-task learning. The second result
(LXMERT w/o Scene Graph) involves fine-tuning LXMERT using only GQA images and their corresponding question–answer pairs without utilizing scene graphs. In this case, scene graphs were not used during the training and evaluation processes. The two models exhibited a significant difference in accuracy, with a margin of 21.5. This indicates that scene graphs contain sufficient semantic information about the images. More importantly, it suggests that the model’s understanding of images is improved through the reasoning of scene graphs expressed in natural language.
In addition, another experiment was conducted to assess the accuracy of answer prediction when provided with only the scene graph of an image and a corresponding question without the actual image to strengthen the claim that scene graph reasoning plays a helpful role in multimodal understanding. Since BERT is a text understanding model embedded in LXMERT, this experiment leverages BERT to train questions and scene graphs. The result (BERT + scene graph) in
Table 4 showed a higher accuracy by 10.2 than the second result trained only with images and questions without a scene graph. This indicates that even without direct image information, utilizing natural language descriptions that capture the structural characteristics of the image allows for a certain level of visual understanding. Therefore, high-quality scene graphs can assist the model in comprehending images which have lower expressive power compared to text.
This experiment was designed to verify the generalization capability of the proposed model for other visual language tasks such as Visual Question Answering (VQA) [
17] and Natural Language for Visual Reasoning (NLVR) [
43]. The proposed method demonstrates effectiveness across similar visual language understanding datasets, regardless of the types of multimodal transformer. As shown in
Table 5, the proposed LXMERT-based approach achieves 1.18 higher accuracy than the baseline in VQA. Similarly, it shows a 0.59 improvement in accuracy in NLVR. Likewise, the proposed OSCAR-based method, as shown in
Table 6, achieves 0.63 higher accuracy than the baseline in VQA and shows a 0.45 improvement in accuracy in NLVR. The experimental results generally indicate performance improvement, although the magnitude of improvement is not substantial. This can be attributed to the absence of scene graphs in the evaluation data, where only images and text are provided, creating a significantly different evaluation environment from the training setting, which may hinder performance improvement.