Multimodal Natural Language Explanation Generation for Visual Question Answering Based on Multiple Reference Data

Zhu, He; Togo, Ren; Ogawa, Takahiro; Haseyama, Miki

doi:10.3390/electronics12102183

Open AccessArticle

Multimodal Natural Language Explanation Generation for Visual Question Answering Based on Multiple Reference Data

¹

Graduate School of Information Science and Technology, Hokkaido University, N-14, W-9, Kita-ku, Sapporo 060-0814, Japan

²

Faculty of Information Science and Technology, Hokkaido University, N-14, W-9, Kita-ku, Sapporo 060-0814, Japan

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(10), 2183; https://doi.org/10.3390/electronics12102183

Submission received: 25 March 2023 / Revised: 20 April 2023 / Accepted: 9 May 2023 / Published: 10 May 2023

(This article belongs to the Special Issue Trends and Prospects in Hybrid Methods for Natural Language Processing)

Download

Browse Figures

Versions Notes

Abstract

:

As deep learning research continues to advance, interpretability is becoming as important as model performance. Conducting interpretability studies to understand the decision-making processes of deep learning models can improve performance and provide valuable insights for humans. The interpretability of visual question answering (VQA), a crucial task for human–computer interaction, has garnered the attention of researchers due to its wide range of applications. The generation of natural language explanations for VQA that humans can better understand has gradually supplanted heatmap representations as the mainstream focus in the field. Humans typically answer questions by first identifying the primary objects in an image and then referring to various information sources, both within and beyond the image, including prior knowledge. However, previous studies have only considered input images, resulting in insufficient information that can lead to incorrect answers and implausible explanations. To address this issue, we introduce multiple references in addition to the input image. Specifically, we propose a multimodal model that generates natural language explanations for VQA. We introduce outside knowledge using the input image and question and incorporate object information into the model through an object detection module. By increasing the information available during the model generation process, we significantly improve VQA accuracy and the reliability of the generated explanations. Moreover, we employ a simple and effective feature fusion joint vector to combine information from multiple modalities while maximizing information preservation. Qualitative and quantitative evaluation experiments demonstrate that the proposed method can generate more reliable explanations than state-of-the-art methods while maintaining answering accuracy.

Keywords:

visual question answering; explanation generation; interpretable learning; computer vision; multimodal model

1. Introduction

Deep learning has advanced rapidly in recent years, resulting in the practical application of artificial intelligence systems in various scenarios, and having a significant impact on society [1]. Since a multi-step reasoning task involves computer vision (CV) [2] and natural language processing (NLP) [3], visual question answering (VQA) [4] has become one of the most important areas of development for detecting defects in flow lines [5] and for complementary diagnostics [6,7]. Because these applications are directly related to the safety of people’s lives and property, new regulations have been introduced, requiring model reliability and increasing the demand for model interpretability [8,9]. While current VQA methods primarily focus on performance improvement, they remain as black boxes, lacking transparency in their question-answering processes. As a result, they may not meet users’ reliability requirements. Furthermore, if a model generates an incorrect answer, there may be no explicit method to improve the predictions if the model’s prediction logic is not understood.

Providing a reasonable explanation of the question-answering process has been proposed as a solution to enhance the reliability of VQA models [10]. Traditional explanation generation models use heat maps to indicate the influence of different regions of the image on the decision-making process [11,12]. These approaches visualize the analysis of the underlying logic of neural networks using mathematical principles, which is essential for developing deep learning research. However, for most non-expert users of AI systems, these explanations are abstract or even incomprehensible. Thus, a method that can help them simply judge the reliability of the judgments given by deep learning is necessary. To address this problem, natural language explanation generation models have been proposed by Park et al. [13] and Wu et al. [14] to generate explanations that are both human-readable and provide a high level of granularity. These approaches use independent models to generate explanations for the VQA model. However, it can be challenging to ensure that the generated explanation is consistent with the reasoning used by the VQA model due to the independence of the explanation generation model. Recently, pre-trained transformer-based large-scale models, such as the generative pre-trained transformer (GPT)-2 [15] and contrastive language-image pre-training (CLIP) [16], have rapidly developed. These models have improved the ability to understand images and the accuracy of generated sentences using a large amount of training data. Marasovic et al. [17], Kayser et al. [18], and Sammani et al. [19] have addressed the limitations of previous models by proposing intrinsic interpretability models. These models integrate the generation of explanations and answers into a single model, ensuring that generated explanations are consistent with the logic of the question-answering process. Although these models ensure the consistency of the underlying logic, there are still many implausible cases of generated explanations.

As depicted in Figure 1, explanations provided by humans are typically more diverse and include multiple aspects compared to those generated by the original natural language explanation model, which relies solely on image information. The reason for this disparity is that answering a question requires knowledge beyond the scope of perception and recognition.

Large-scale models are pre-trained on a large and diverse range of datasets to acquire a wide array of knowledge. However, the current approach of fine-tuning large-scale models to downstream tasks leads to catastrophic forgetting [20], which leaves the knowledge learned in pre-training missing. Therefore, relying only on the information in the pre-trained models is not sufficient, and it is necessary to introduce additional reference information.

Figure 1. Limitations of the original explanation model. Although the previous natural language explanation generation model can provide an explanation, it lags behind humans in terms of details and rationality. We used our pervious work [21] for the original explanation generation model in the figure.

In our previous works [21,22], we introduced image captions and caption-based outside knowledge as novel modalities to improve model performance. However, the generated caption in Figure 1 highlights the limitations of our previous works. There are generated captions that have little relevance to the question, making them inappropriate as references for answering. Furthermore, the outside knowledge retrieved by the caption is uniformly unavailable as a reference.

Humans tend to answer questions holistically, drawing on a range of information sources and considering multiple aspects of the image. For example, when presented with an image, humans typically begin by identifying the objects in the image and then draw on their background knowledge to provide relevant cues when answering questions. In a common VQA dataset, a large proportion of questions either pertain to objects or are influenced by objects within the image. Therefore, properly recognizing objects can effectively improve the model’s understanding of images and ability to answer questions. By emulating this human approach, we incorporate multiple sources of reference information to improve the model’s performance. An effective approach for incorporating reference information is to use object recognition models [23]. These models can accurately identify the categories and quantities of objects in an image, which can then be used as reference information during the explanation generation process. In addition, outside knowledge that contains information beyond the image is typically considered when answering questions [24]. Therefore, we adopted an approach that combines both image and question information to directly retrieve relevant outside knowledge.

This study proposes a multimodal natural language explanation generation model based on multiple reference data. The multiple references contain three kinds of textual data: multiple image captions, outside knowledge, and object information. Specifically, we used different image-captioning models to generate multiple captions and an object detection model to obtain information about the objects in the image. We retrieved the outside knowledge most relevant to the question and the image from Wikipedia [25]. By using input images and questions in the retrieval process, rather than relying on potentially irrelevant captions, our approach mitigates the issue of obtaining inadequate outside knowledge. Moreover, we used two different multilayer perceptrons (MLPs) with many hidden layers to process vision and language features separately and a joint vector containing all information for the generation process. By using the novel joint vector to generate results, our model can improve the rationality of generated explanations and the accuracy of generated answers. Finally, we summarize the contributions of this study as follows.

We introduce multiple reference sources into the explanation generation model to solve the informativeness problem and analyze the impact of different reference information on model performance.
We use images and questions for retrieval to solve the problem of outside knowledge failure caused by caption-based retrieval methods in the previous work [26].
We use a simple and efficient feature fusion method to address the problem of fusing multiplied multimodal reference information features.
The experimental results show that the generated answers are more correct, and the generated explanations are more comprehensive and reasonable than answers and explanations generated by other state-of-the-art models.

Note that this study is an extended version of our previous work [26]. Compared with our previous work, we modified the outside knowledge retrieval approach to solve the problem of outside knowledge failure. We introduce object information as a supplement to reference information to improve model performance. We also improve the feature fusion approach to make it simpler and more efficient. In contrast to previous works where only reference information was added, this study investigates how the individual reference information affects the model’s performance through ablation experiments. As a result, this study serves as an extension and summary of previous works, further improving the quantity and quality of reference information while synthesizing and refining the theory of introducing reference information. This study also improves the experimental results, which showed that in language evaluation metrics, our model was 3.3% higher than the state-of-the-art model [19] and 1.8% higher than the previous work [26].

We structured the paper as follows: Section 2 provides an overview of related work, and Section 3 introduces our proposed multimodal natural language explanation generation model for VQA, which incorporates multiple reference sources. Section 4 presents both the qualitative and quantitative experimental results, and Section 5 provides a detailed discussion. Finally, we conclude with a summary of our work in the last section Section 6.

2. Related Works

2.1. Visual Question Answering

The VQA task involves answering questions about an image, requiring an understanding of the image content and textual information. It is, therefore, an interdisciplinary task that integrates both CV and NLP. Depending on whether outside knowledge is required for reasoning, the VQA task can be divided into two categories. The VQA tasks that do not depend on outside knowledge refer to tasks such as fine-grained recognition (e.g., “What beverage is in the cup?”), scene recognition (e.g., “Is it raining?”), and activity recognition (e.g., “What sport is being played?”). The VQA tasks that depend on outside knowledge include knowledge-based reasoning (e.g., “Is this food healthy?”) and common sense reasoning (e.g., “Is this a professional game?”).

As a challenging task, VQA was first introduced in the study by S. Antol et al. [4] and has been improved by many researchers to address different limitations. Ben-younes et al. [27] and Jiang et al. [28] improved the efficiency of the VQA model, and Li et al. [29] and Tang et al. [30] used the attention mechanism and adversarial learning to improve the prediction accuracy of the model. Marino et al. [24] recognized that when humans answer questions about images, they consider both the information contained in the image and background knowledge outside the image. To address this, they proposed outside knowledge VQA, which uses outside knowledge to answer visual questions. Their findings on the importance of outside knowledge highlight the significance of referring to multiple sources of information.

2.2. Natural Language Explanation Generation for the Visual Question Answering

The interpretability of deep learning models has been a significant area of focus for researchers. Visual interpretation of decisions has been the subject of a diverse range of research efforts. Several approaches find differentiated visual patches [11,12], whereas others aim to understand intermediate features necessary for the final decision [31,32,33], such as what a particular neuron represents. These methods use heat maps to visualize the influences of different regions in the image on the decisions made by deep learning models. However, heat maps may be inefficient for specialized images, such as radiology images, due to their complexity and the need for specialized expertise to interpret them.

Researchers have proposed models for generating human-understandable explanations for VQA in natural language, replacing abstract heat maps [13,14,17]. This allows individuals to understand the explanations of the model without the need for specialized background knowledge. However, these models use an explanation generation model that is independent of the question-answering model, which presents a challenge in verifying that the response model uses the logic expressed by the explanation. To address the limitation that affects the reliability of explanations generated by the model, Kayser et al. [18] and Sammani et al. [19] used a single model to generate both answers and explanations, ensuring that the generated explanations accurately reflect the logic used by the models to answer the questions. These methods only refer to the input image, and the model performance is limited by the lack of reference information. Therefore, our previous works [21,22] introduced image captions and outside knowledge as additional modalities to improve model performance. However, there are cases where the generated caption is invalid, which also invalidates the outside knowledge by association.

2.3. Large-Scale Language Models

Large-scale language models (LLMs) have been at the forefront of NLP research in recent years [34]. Several NLP tasks, such as language understanding [35], generation [36], and translation [37], have achieved remarkable performance through the use of these models, resulting in state-of-the-art outcomes [35,36]. The key advantage of LLMs is their ability to capture complex patterns and relationships in language, made possible by their training on large amounts of text data. Thus, LLMs have become essential tools for NLP research and applications.

In 2018, Radford et al. [38] introduced the original GPT as a prominent LLM. GPT uses a transformer-based architecture and a pretraining strategy called language modeling to generate high-quality texts. In 2019, OpenAI (https://openai.com/, accessed on 25 March 2023) released GPT-2 [15], an improved version of the original GPT model. GPT-2 has been trained on an even larger corpus of text data than its predecessor, making it one of the most powerful language models currently available. Similar to GPT, GPT-2 uses a transformer-based architecture, which allows it to learn contextualized representations of words and sentences. GPT-2 employs a novel unsupervised learning approach in its training method, wherein the model predicts the subsequent word in a text sequence. This training approach enables GPT-2 to produce high-quality text that is typically indistinguishable from human-generated texts.

3. Multimodal Explanation Generation Model Based on Multiple Reference Information

We present an overview of our model in Figure 2. We propose a model that simultaneously generates explanations and answers input questions. This approach ensures that the generated explanation is consistent with the question-answering process of the model, resulting in a more coherent and interpretable output.

We incorporate multiple reference sources in addition to the input image to improve the accuracy of the generated answers and the rationality of the generated explanations. Each n-th instance of the training data consists of an input image

I_{n}

and a corresponding question

Q_{n}

. The index n ranges from 1 to N, where N represents the total number of training data points. Our model generates an output sentence

W_{n} = {w_{1}, \dots, w_{j}, \dots, w_{J}}

, where

w_{j}

denotes the j-th word in the sentence. The output sentence contains the answer and explanation of the question, connected by the word “because.”

We first crop it into small pieces of size (

N_{grid} \times N_{grid}

), denoted as

I_{n} = {i_{1}, i_{2}, \dots, i_{N_{grid} \times N_{grid}}}

to extract features from the input image. Then, we use a transformer-based vision encoder, denoted as

E_{V}

, to extract features from each image piece. The image feature

f_{n}^{I}

can be calculated as follows:

f_{n}^{I} = E_{V} ({i_{1}, i_{2}, \dots, i_{N_{grid} \times N_{grid}}}) .

(1)

By utilizing the attention mechanism in the encoder

E_{V} (\cdot)

, we can obtain image features

f_{n}^{I}

that represent the interrelationship between different regions of the image.

3.1. Generation of Multiple References

In contrast to previous models [17,18,19], which only refer to image information, we introduce multiple references, multiple image captions, outside knowledge, and object information to improve the accuracy and rationality of the generated explanations. We use pre-trained models to generate this information, and all three types of reference information are in a text format. To bridge the gap between modalities in this multimodal model, we use the same language encoder

E_{L} (\cdot)

to encode textual contents.

3.1.1. Generation of Multiple Image Captions

As captions generated by a single model contain limited information, we use multiple image-captioning models pre-trained on different datasets, denoted as

G_{C} (\cdot)

, to generate multiple captions denoted as

C_{n} = {c_{n}^{1}, \dots, c_{n}^{h}, \dots, c_{n}^{H}}

(H represents the total number of captions in the set) from the input image

I_{n}

. We use

E_{L} (\cdot)

to extract the features

f_{h}^{c}

of each caption

c_{n}^{h}

, and the extracted features are summed to obtain the final caption feature, denoted as

f_{n}^{C}

as follows:

f_{n}^{C} = f_{1}^{c} + \dots + f_{H}^{c} .

(2)

By using

f_{n}^{C}

in the final generation process, we introduce captions containing multiple types of information into the model as a reference. This approach allows the model to use information from different modalities to complement the visual information provided by the image.

3.1.2. Generation of Outside Knowledge

Outside knowledge related to images can be used as a reference for generating explanations, whereas knowledge related to questions can help the model understand them. Specifically, we use a retrieval model

R (\cdot, \cdot)

to retrieve the most relevant knowledge items

K_{n}

from a collection of knowledge items denoted as

K

, based on the cosine similarity between the extracted feature vectors. Since the outside knowledge set

K

contains nearly 200,000 items and the training set contains nearly 30,000 data items, the total number of computations is close to 6 billion. Therefore, we chose a retrieval model

R (\cdot, \cdot)

that could use GPU to accelerate cosine similarity matching, which greatly improves the efficiency of outside knowledge retrieval under the condition of guaranteeing retrieval accuracy.

We use the language encoder

E_{L} (\cdot)

, respectively, to encode the input question

Q_{n}

and the outside knowledge base

K

. The extracted features are then used in conjunction with the previously extracted image features

f_{n}^{I}

to retrieve question- and image-based outside knowledge items using

R (\cdot, \cdot)

. The retrieved outside knowledge items are stored in an outside knowledge set

K_{n} = {k_{n}^{1}, \dots, k_{n}^{m}, \dots, k_{n}^{M}}

(M represents the total number of retrieved outside knowledge items). Each knowledge item in

K_{n}

is a description of an object consisting of one or several sentences, and all of the items are taken from Wikipedia. This set contains knowledge items retrieved using both the question and image. The retrieval process can be represented as follows:

K_{n} = R (E_{L} (Q_{n}), E_{L} (K)), R (f_{n}^{I}, E_{L} (K)) .

(3)

We use an approach similar to that used for image-captioning to incorporate outside knowledge into our final prediction process. Specifically, we extract the feature vector

f_{n}^{K}

for each outside knowledge item using

E_{L} (\cdot)

, as follows:

f_{n}^{K} = E_{L} (k_{n}^{1}) + \dots + E_{L} (k_{n}^{M}) .

(4)

We extract feature vectors of outside knowledge, which contain external information about the image. The outside knowledge feature vector

f_{n}^{K}

is then merged with other extracted feature vectors during the final prediction process.

3.1.3. Generation of Object Information

The object information in the image can be directly used as a reference for the answer and can also be used to generate a more reasonable explanation. We introduce an object detection model

G_{O} (\cdot)

to detect the presence and number of objects in the image. We finally obtained the detected objects and the number of occurrences in the image as follows:

{o_{1} : n u m_{o_{1}}, \dots, o_{p} : n u m_{o_{p}}, \dots, o_{P} : n u m_{o_{P}}},

where

o_{p}

represents the p-th detected object, and

n u m_{o_{p}}

represents the number of occurrences in the image. We use text to represent the information of objects in the images to be consistent with the other reference information and to reduce the difficulty of model training. We use a language encoder to extract the high-dimensional features in the textual information and use it as a reference for the final model generation. We describe the condition of the detected object in the image using a sentence,

O_{n}

, as follows:

T h e r e i s / a r e n u m_{o_{1}} o_{1}, \dots, n u m_{o_{p}} o_{p}, \dots, a n d n u m_{o_{P}} o_{P} i n t h i s i m a g e .

When there is no detected object in the image, we use “Although nothing was detected, there could be something” as

O_{n}

to describe the condition of the object in the image, considering that the object detection model can detect only a limited number of objects. We encode

O_{n}

using the language encoder

E_{L} (\cdot)

and obtain the feature

f_{n}^{O}

that preserves the information about the object in the image.

3.2. Answer and Explanation Generation

In multimodal tasks, feature fusion is crucial to achieving optimal performance [39]. We use a joint vector that combines the image and language features to expand the information available in our model. We use vision and language encoders that map these features to a shared latent space to reduce field conflicts.

We use a simple and efficient feature fusion approach to balance the computational consumption and the fusion effect. We employ MLPs to process the extracted feature vectors and concatenate the processed vectors into a joint vector for fusion. The joint vectors contain the reference information implied by all features and are used to generate the answers and corresponding explanations. Specifically, two different MLPs denoted by

g_{I}

and

g_{L}

are used to process the image feature

f_{n}^{I}

and language features

f_{n}^{C}

,

f_{n}^{K}

, and

f_{n}^{O}

, respectively. The employed MLPs ensure the consistency of the different vector dimensions of different modalities and solve the field conflict problem. Subsequently, we use the concatenating operation to generate the joint vector

f_{n}^{J}

. However, the image feature vector has three dimensions, while the language feature vectors have two dimensions. Thus, we use the

u n s q u e e z e (\cdot)

operation before vector concatenation to solve the dimension mismatch problem as follows:

\begin{matrix} f_{n}^{J} = c o n c a t e n a t e (g_{I} (f_{n}^{C}), u n s q u e e z e (g_{L} (f_{n}^{C})), u n s q u e e z e (g_{L} (f_{n}^{K})), u n s q u e e z e (g_{L} (f_{n}^{O}))) . \end{matrix}

(5)

Since we use the concatenation to connect the previously extracted feature vectors, our model combines feature vectors from different modalities while preserving as much information as possible. Thus, the joint vector

f_{n}^{J}

contains all of the reference information.

Our model generates a sentence

W_{n}

containing the answer to the question

Q_{n}

and the corresponding explanation. This sentence is generated by the decoder

D

, which decodes the input feature

f_{n}^{J}

, along with

Q_{n}

. Specifically, the context in

W_{n}

is constructed as

{q u e s t i o n} + {a n s w e r} + b e c a u s e + {e x p l a n a t i o n}

, which can be calculated as follows:

W_{n} = D (f_{n}^{J}, Q_{n}) .

(6)

The generated answers and explanations are decoded from the joint vector

f_{n}^{J}

containing all of the reference information and, thus, we achieve diverse information, referring to different modalities in the generation process. Our model is trained using the cross-entropy loss to minimize the negative log-likelihood. This can be computed as follows:

L = - \sum_{j = 1}^{N} \log p_{θ} (w_{t} | w_{< j}) .

(7)

This is computed as the sum of the log probabilities of each word in the generated sentence

W_{n}

, given the words before it. Specifically, we denote the words before the t-th word as

w < j

and

p θ (\cdot)

as the probability mass function parameterized by

θ

. The distribution of the model is represented by

θ

, and we optimize it using the loss

L

. Training the model with this loss ensures that it generates accurate explanations close to the ground truth.

4. Experiments

4.1. Experimental Settings

For our experiments, we used the VQA-X dataset [13] and the e-SNLI-VE dataset [18]. The VQA-X dataset is an extension of the VQA-v2 dataset [4]. As depicted in Figure 3, the VQA-X dataset includes explanations for each answer and is a multimodal dataset that combines vision and language. The VQA-X dataset consists of 33,000 question–answer (QA) pairs for more than 28,000 images, all of which were obtained from the COCO2014 dataset [40]. Specifically, we used 24,876 images from the COCO2014 training set, which contained 29,459 QA pairs, as our training set. To create our validation and test sets, we partitioned the COCO2014 validation set in a 3:4 ratio, resulting in approximately 1500 and 2000 QA pairs, respectively. The e-SNLI-VE dataset consists of more than 30,000 images from the Flickr30k dataset [41], and each image has several corresponding QA pairs. The training and test sets have 401,717 QA pairs and 14,740 QA pairs, respectively.

Our main reason for selecting vision and language encoders is their efficiency in extracting features and matching features across modalities. We used the vision encoder of CLIP to extract simple grid features. Traditional vision models are typically designed for specific tasks such as image classification and segmentation, whereas the approach of CLIP to the multimodal task of image and text matching allows its vision and language encoders to be used simultaneously.

We employed five different image-captioning models: GIT-large [42] fine-tuned on COCO, GIT-large [42] fine-tuned on TextCaps, BLIP [43], CoCa [44], and BLIP-2 [45], to generate multiple image captions. In Appendix A, we present examples of generated captions in Figure A1. Our results demonstrate that different image-captioning models can frequently provide complementary information. Incorporating multiple captions enables our model to utilize a wider range of information.

A knowledge base containing various types of knowledge information is required to retrieve outside knowledge useful for answering questions and generating explanations. Wikipedia (https://www.wikipedia.org/, accessed on 25 March 2023) has emerged as a valuable resource for various applications due to its structured data collection. As depicted in Figure 4, our questions focus on topics such as daily life, sports, and animals.

In our experiments, we chose a recently proposed subset from Wikipedia that follows the methodology of KAT [46] as the outside knowledge set

K

of our work. All of the experiments in this study use the same outside knowledge set. For outside knowledge retrieval, we used Faiss [47] to calculate the cosine similarity between vectors to complete the retrieval.

We used YOLOv5 [48] to introduce object information from the image. We also show samples of object information in our experiments in Figure A3 in Appendix A. As the figure shows, YOLOv5 performs well in the dataset’s most common scenes, detecting a majority of valuable objects in the images. For our image and language MLPs, we set the hidden layer’s size to 512 and 4096, respectively. Furthermore, we set the number of captions in our caption set, denoted by H, and the number of items in our outside knowledge set, denoted by M, to 5 and 3, respectively.

All models were trained for 30 epochs during the training process to ensure model convergence. All images were resized to 224 × 224 pixels and subjected to a random flip to prevent overfitting. The learning rate was initially set to

2 \times 10^{- 5}

and decreased gradually until

10^{- 5}

, and the batch size was fixed at 32. The parameter settings of all experiments are consistent with the paper and published source code of the comparison models [17,18,19] without any adjustment to ensure the fairness of the comparison experiments. We used Accelerate (https://github.com/huggingface/accelerate, accessed on 25 March 2023) in the training process. The proposed models were trained using a computer with an AMD (Advanced Micro Devices; Santa Clara, CA, USA) EPYC 7713P 64 core/128 threads Processor, 512 GB of RAM, and an NVIDIA (Santa Clara, CA, USA) CorporationGA100 graphics card, with a total training time of 9 h.

4.2. Evaluation

Table 2 shows the details of the proposed methods and four state-of-the-art natural language methods used in our experiments. The CMs answer questions and generate natural language explanations based solely on the input image.

The following common language modeling evaluation metrics, namely, BLEUn (n is 1 to 4) [49], METEOR [50], ROUGE [51], SPICE [52], and CIDEr [53], were used to evaluate the generated explanations. Meanwhile, we used accuracy to evaluate the generated answers. Finally, we used the publicly available project provided by Chen et al. [54] to compute all language metric scores.

4.3. Experimental Results

4.3.1. Quantitative Analysis

As presented in Table 3 and Table 4, the language metric results on the VQA-X and e-SNLI-VE dataset prove that our model outperforms the CMs by utilizing multiple sources of information. This shows that referring to multiple sources of information in addition to images can effectively improve the model’s understanding of the questions and images, thereby improving the correctness of the generated answers and the plausibility of the generated explanations.

As presented in Table 5, we also conducted several ablation experiments on the VQA-X dataset to evaluate the impact of introducing image captions, outside knowledge, and object information, respectively. The results in Table 5 show that introducing reference information alone outperforms the model with only image references. However, the best performance is achieved when all three reference contents are introduced simultaneously.

In feature fusion, our previous study [26] encoded each reference information separately using MLPs with small hidden sizes. When the variety of reference information is small, multiple MLPs provide the model with a more decadent learning space and can better fuse features with good performance. However, this approach becomes inefficient as the variety of reference information increases because the features encoded by one linguistic encoder are in a single space. Instead, too many MLPs can cause learning difficulties. According to the results shown in Table 5, using four MLPs to separately process different reference information yields worse results than using two MLPs.

4.3.2. Qualitative Analysis

Figure 5 qualitatively compares the proposed method and CM4, where the key information contained in the generated multiple references is marked in purple. The results show that our method correctly answers questions that the CMs cannot due to the use of additional reference information. The experimental results demonstrate that the three reference contents may not always contain useful information. However, there will always be references that contain useful information for answering questions and generating explanations, which proves that diverse reference contents can effectively improve the model’s performance by complementing each other in different scenarios.

5. Discussions

We propose a natural language explanation generation model to solve the problem of insufficient information caused by relying on image input only. Our approach leverages multiple references and employs an efficient feature fusion technique, resulting in superior question answering and explanation generation performance compared with the state-of-the-art methods. Since our research can help lay people in judging the reliability of AI system judgments, our research has a wide range of potential applications in areas such as paramedicine and fault detection. In addition to its applications in other fields, our research can also be used to analyze the performances of current models and clarify the direction of model improvement by observing the reasons behind wrong judgments.

5.1. Limitations

Our model demonstrates robust performance in both qualitative and quantitative evaluations, highlighting the effectiveness of leveraging multiple reference sources. However, the current way of introducing information through textual information alone is still inefficient. Objects in images and outside knowledge have complex hierarchical relationships, and representing them through text alone can lead to the loss of these relationships. As a result, the model may struggle to accurately answer subjective or detail-oriented questions and generate reasonable explanations. In addition, there are limitations to this outside knowledge retrieval approach, which relies on cosine similarity. This approach can result in irrelevant or weakly relevant knowledge being retrieved. Furthermore, we simply used a cross-entropy loss to train the entire model. Considering that feature fusion is a very important part of this study, not constraining the training of the MLP using loss may lead to insufficient performance of the model.

The large-scale language model used in this study also has its limitations [55]. Firstly, the problem with computer resources is that training large-scale language models requires enormous computational resources, leading to their development limitations and making them difficult to deploy in some real-life platforms. Second, large-scale language models are complex and difficult to reproduce, making it challenging to verify or validate their performance or investigate issues related to their operation [56].

5.2. Future Works

In future work, we will aim to improve the performance of the model in two ways. First, we plan to improve the utilization of reference information by exploring the use of trees or graphs to construct links to extra-image information, as proposed by Daniel et al. [57], Chen et al. [58], and Hu et al. [22]. This approach would allow the model to make better use of additional information by incorporating the predictions of prior distributions into the model. In recent years, the development of LLMs, such as ChatGPT [59], has opened up new possibilities for retrieving outside knowledge. Our future work will go beyond simply using vector similarity to retrieve knowledge from Wikipedia and will also retrieve outside knowledge from LLMs. Second, we will aim to develop a personalized [60] question-and-answer model. Rudovic et al. [61] proposed different models for different age groups of children. We will vary the outside knowledge to simulate different ages, regions, and educational backgrounds to enable the model to produce outputs for different groups when answering the same question.

6. Conclusions

The proposed method has demonstrated high accuracy in generating explanations by integrating multiple reference sources. Our approach involves using multiple image-captioning models to generate multiple image captions, retrieving relevant outside knowledge, and detecting objects using an object detection model. By exploiting multiple references, our approach can address the problem of informativeness. Furthermore, our novel feature fusion method improves the efficiency of feature fusion in multimodal tasks. The experimental results show that the proposed method outperforms several state-of-the-art methods for question-answering tasks and explanation generation tasks in both qualitative and quantitative evaluations.

Author Contributions

Conceptualization: H.Z., R.T. and T.O.; methodology: H.Z.; data curation: H.Z.; writing—original draft: H.Z., R.T. and T.O.; writing—review and editing: R.T., T.O. and M.H.; supervision: M.H.; funding acquisition: T.O. All authors have read and agreed to the published version of the manuscript.

Funding

This study was partly supported by JSPS KAKENHI, grant numbers JP21H03456 and JP20K19857.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

We show samples of the generated image caption, retrieved external knowledge, and object information from multiple references in Figure A1, Figure A2 and Figure A3, respectively.

Figure A1. Samples of the multiple captions. The different models can complement each other to describe the information within the image as completely as possible.

Figure A2. Samples of the retrieved outside knowledge. Keywords that are of reference value for answering questions and generating explanations are highlighted in green.

Figure A3. Samples of object information. In the three most frequent scenarios in the dataset, the object detection model can effectively detect the key objects in the images.

References

Makridakis, S. The forthcoming Artificial Intelligence (AI) revolution: Its impact on society and firms. Futures 2017, 90, 46–60. [Google Scholar] [CrossRef]
Kang, J.S.; Kang, J.; Kim, J.J.; Jeon, K.W.; Chung, H.J.; Park, B.H. Neural Architecture Search Survey: A Computer Vision Perspective. Sensors 2023, 23, 1713. [Google Scholar] [CrossRef] [PubMed]
Khurana, D.; Koli, A.; Khatter, K.; Singh, S. Natural language processing: State of the art, current trends and challenges. Multimed. Tools Appl. 2023, 82, 3713–3744. [Google Scholar] [CrossRef]
Antol, S.; Agrawal, A.; Lu, J.; Mitchell, M.; Batra, D.; Zitnick, C.L.; Parikh, D. VQA: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 2425–2433. [Google Scholar] [CrossRef]
Czimmermann, T.; Ciuti, G.; Milazzo, M.; Chiurazzi, M.; Roccella, S.; Oddo, C.M.; Dario, P. Visual-based defect detection and classification approaches for industrial applications—A survey. Sensors 2020, 20, 1459. [Google Scholar] [CrossRef] [PubMed]
Dhar, T.; Dey, N.; Borra, S.; Sherratt, R.S. Challenges of Deep Learning in Medical Image Analysis—Improving Explainability and Trust. IEEE Trans. Technol. Soc. 2023, 4, 68–75. [Google Scholar] [CrossRef]
Zhu, H.; Togo, R.; Ogawa, T.; Haseyama, M. Diversity Learning Based on Multi-Latent Space for Medical Image Visual Question Generation. Sensors 2023, 23, 1057. [Google Scholar] [CrossRef]
Huang, X. Safety and Reliability of Deep Learning: (Brief Overview). In Proceedings of the 1st International Workshop on Verification of Autonomous & Robotic Systems, Philadelphia, PA, USA, 23 May 2021. [Google Scholar] [CrossRef]
Molnar, C. Interpretable Machine Learning: A Guide for Making Black Box Models Explainable; Lean Publishing: Victoria, BC, Canada, 2020. [Google Scholar]
Lipton, Z.C. The mythos of model interpretability: In machine learning, the concept of interpretability is both important and slippery. Queue 2018, 61, 31–57. [Google Scholar] [CrossRef]
Berg, T.; Belhumeur, P.N. How Do You Tell a Blackbird from a Crow? In Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia, 1–8 December 2013. [Google Scholar]
Doersch, C.; Singh, S.; Gupta, A.; Sivic, J.; Efros, A. What makes paris look like paris? ACM Trans. Graph. 2012, 31, 101. [Google Scholar] [CrossRef]
Park, D.H.; Hendricks, L.A.; Akata, Z.; Rohrbach, A.; Schiele, B.; Darrell, T.; Rohrbach, M. Multimodal explanations: Justifying decisions and pointing to the evidence. In Proceedings of the IEEE/CVF Conference on Conference Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8779–8788. [Google Scholar] [CrossRef]
Wu, J.; Mooney, R. Faithful Multimodal Explanation for Visual Question Answering. In Proceedings of the ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Florence, Italy, 1 August 2019; pp. 103–112. [Google Scholar] [CrossRef]
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Online, 18–24 July 2021; pp. 8748–8763. [Google Scholar] [CrossRef]
Marasović, A.; Bhagavatula, C.; sung Park, J.; Le Bras, R.; Smith, N.A.; Choi, Y. Natural Language Rationales with Full-Stack Visual Reasoning: From Pixels to Semantic Frames to Commonsense Graphs. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP, Online, 16–20 November 2020; pp. 2810–2829. [Google Scholar] [CrossRef]
Kayser, M.; Camburu, O.M.; Salewski, L.; Emde, C.; Do, V.; Akata, Z.; Lukasiewicz, T. E-ViL: A dataset and benchmark for natural language explanations in vision-language tasks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 1244–1254. [Google Scholar] [CrossRef]
Sammani, F.; Mukherjee, T.; Deligiannis, N. NLX-GPT: A Model for Natural Language Explanations in Vision and Vision-Language Tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 8322–8332. [Google Scholar] [CrossRef]
Li, X.; Zhou, Y.; Wu, T.; Socher, R.; Xiong, C. Learn to Grow: A Continual Structure Learning Framework for Overcoming Catastrophic Forgetting. Proc. Mach. Learn. Res. 2019, 97, 3925–3934. [Google Scholar]
Zhu, H.; Togo, R.; Ogawa, T.; Haseyama, M. A multimodal interpretable visual question answering model introducing image caption processor. In Proceedings of the IEEE 11th Global Conference on Consumer Electronics, Osaka, Japan, 18–21 October 2022; pp. 805–806. [Google Scholar] [CrossRef]
Hu, X.; Gu, L.; Kobayashi, K.; An, Q.; Chen, Q.; Lu, Z.; Su, C.; Harada, T.; Zhu, Y. Interpretable Medical Image Visual Question Answering via Multi-Modal Relationship Graph Learning. arXiv 2023, arXiv:2302.09636. [Google Scholar]
Zhao, Z.Q.; Zheng, P.; Xu, S.T.; Wu, X. Object detection with deep learning: A review. IEEE Trans. Neural Networks Learn. Syst. 2019, 30, 3212–3232. [Google Scholar] [CrossRef] [PubMed]
Marino, K.; Rastegari, M.; Farhadi, A.; Mottaghi, R. Ok-VQA: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3195–3204. [Google Scholar]
Vrandečić, D.; Krötzsch, M. Wikidata: A free collaborative knowledgebase. Commun. ACM 2014, 57, 78–85. [Google Scholar] [CrossRef]
Zhu, H.; Togo, R.; Ogawa, T.; Haseyama, M. Interpretable Visual Question Answering Referring to Outside Knowledge. arXiv 2023, arXiv:2303.04388. [Google Scholar]
Ben-Younes, H.; Cadene, R.; Thome, N.; Cord, M. BLOCK: Bilinear superdiagonal fusion for visual question answering and visual relationship detection. Proc. AAAI Conf. Artif. Intell. 2019, 33, 8102–8109. [Google Scholar] [CrossRef]
Jiang, H.; Misra, I.; Rohrbach, M.; Learned-Miller, E.; Chen, X. In defense of grid features for visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10267–10276. [Google Scholar] [CrossRef]
Li, X.; Song, J.; Gao, L.; Liu, X.; Huang, W.; He, X.; Gan, C. Beyond RNNs: Positional self-attention with co-attention for video question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 8658–8665. [Google Scholar] [CrossRef]
Tang, R.; Ma, C.; Zhang, W.E.; Wu, Q.; Yang, X. Semantic equivalent adversarial data augmentation for visual question answering. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 437–453. [Google Scholar]
Escorcia, V.; Carlos Niebles, J.; Ghanem, B. On the relationship between visual attributes and convolutional networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1256–1264. [Google Scholar] [CrossRef]
Zeiler, M.D.; Fergus, R. Visualizing and understanding convolutional networks. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 818–833. [Google Scholar]
Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; Torralba, A. Object detectors emerge in Deep Scene CNNs. arXiv 2014, arXiv:1412.6856. [Google Scholar] [CrossRef]
Liu, P.; Yuan, W.; Fu, J.; Jiang, Z.; Hayashi, H.; Neubig, G. Pre-Train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing. ACM Comput. Surv. 2023, 55, 1–35. [Google Scholar] [CrossRef]
Allen, J. Natural Language Understanding; Benjamin-Cummings Publishing Co., Inc.: Menlo Park, CA, USA, 1995. [Google Scholar]
Reiter, E.; Dale, R. Building applied natural language generation systems. Nat. Lang. Eng. 1997, 3, 57–87. [Google Scholar] [CrossRef]
Ranathunga, S.; Lee, E.S.A.; Prifti Skenduli, M.; Shekhar, R.; Alam, M.; Kaur, R. Neural machine translation for low-resource languages: A survey. ACM Comput. Surv. 2023, 55, 1–37. [Google Scholar] [CrossRef]
Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving Language Understanding by Generative Pre-Training. Technical Report, OpenAI. 2018. Available online: https://openai.com/ (accessed on 25 March 2023).
Dai, Y.; Gieseke, F.; Oehmcke, S.; Wu, Y.; Barnard, K. Attentional feature fusion. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikola, HI, USA, 3–8 January 2021; pp. 3560–3569. [Google Scholar] [CrossRef]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
Young, P.; Lai, A.; Hodosh, M.; Hockenmaier, J. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist. 2014, 2, 67–78. [Google Scholar] [CrossRef]
Wang, J.; Yang, Z.; Hu, X.; Li, L.; Lin, K.; Gan, Z.; Liu, Z.; Liu, C.; Wang, L. GIT: A Generative Image-to-text Transformer for Vision and Language. arXiv 2022, arXiv:2205.14100. [Google Scholar] [CrossRef]
Li, J.; Li, D.; Xiong, C.; Hoi, S. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the International Conference on Machine Learning, PMLR, Baltimore, MA, USA, 17–23 July 2022; pp. 12888–12900. [Google Scholar] [CrossRef]
Yu, J.; Wang, Z.; Vasudevan, V.; Yeung, L.; Seyedhosseini, M.; Wu, Y. CoCa: Contrastive Captioners are Image-Text Foundation Models. arXiv 2022, arXiv:2205.01917. [Google Scholar] [CrossRef]
Li, J.; Li, D.; Savarese, S.; Hoi, S. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. arXiv 2023, arXiv:2301.12597. [Google Scholar] [CrossRef]
Gui, L.; Wang, B.; Huang, Q.; Hauptmann, A.; Bisk, Y.; Gao, J. KAT: A knowledge augmented transformer for vision-and-language. arXiv 2021, arXiv:2112.08614. [Google Scholar] [CrossRef]
Johnson, J.; Douze, M.; Jégou, H. Billion-scale similarity search with GPUs. IEEE Trans. Big Data 2019, 7, 535–547. [Google Scholar] [CrossRef]
Jocher, G.; Stoken, A.; Borovec, J.; NanoCode012; Stan, C.; Liu, C.; Hogan, A.; Diaconu, L.; Ingham, D.; Gupta, N.; et al. ultralytics/yolov5: v3.1—Bug Fixes and Performance Improvements 2020. Available online: https://zenodo.org/record/4154370#.ZFo7os5BxPY (accessed on 25 March 2023).
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 7–12 July 2002; pp. 311–318. [Google Scholar] [CrossRef]
Banerjee, S.; Lavie, A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, MI, USA, 29 June 2005; pp. 65–72. [Google Scholar]
Lin, C.Y.; Hovy, E. Automatic evaluation of summaries using n-gram co-occurrence statistics. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational linguistics, Edmonton, AB, Canada, 27 May–1 June 2003; pp. 150–157. [Google Scholar]
Anderson, P.; Fernando, B.; Johnson, M.; Gould, S. SPICE: Semantic propositional image caption evaluation. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 382–398. [Google Scholar] [CrossRef]
Vedantam, R.; Lawrence Zitnick, C.; Parikh, D. CIDEr: Consensus-based image description evaluation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 4566–4575. [Google Scholar]
Chen, X.; Fang, H.; Lin, T.Y.; Vedantam, R.; Gupta, S.; Dollár, P.; Zitnick, C.L. Microsoft COCO Captions: Data collection and evaluation server. arXiv 2015, arXiv:1504.00325. [Google Scholar] [CrossRef]
Nijkamp, E.; Ruffolo, J.; Weinstein, E.N.; Naik, N.; Madani, A. Progen2: Exploring the boundaries of protein language models. arXiv 2022, arXiv:2206.13517. [Google Scholar]
Jozefowicz, R.; Vinyals, O.; Schuster, M.; Shazeer, N.; Wu, Y. Exploring the limits of language modeling. arXiv 2016, arXiv:1602.02410. [Google Scholar]
Fernández-González, D.; Gómez-Rodríguez, C. Dependency parsing with bottom-up Hierarchical Pointer Networks. Inf. Fusion 2023, 91, 494–503. [Google Scholar] [CrossRef]
Chen, S.; Zhao, Q. Rex: Reasoning-aware and grounded explanation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022; pp. 15586–15595. [Google Scholar]
OpenAI. GPT: Language Models. 2021. Available online: https://openai.com/language/models/gpt-3/ (accessed on 25 March 2023).
Li, H.; Srinivasan, D.; Zhuo, C.; Cui, Z.; Gur, R.E.; Gur, R.C.; Oathes, D.J.; Davatzikos, C.; Satterthwaite, T.D.; Fan, Y. Computing personalized brain functional networks from fMRI using self-supervised deep learning. Med. Image Anal. 2023, 85, 102756. [Google Scholar] [CrossRef]
Rudovic, O.; Lee, J.; Dai, M.; Schuller, B.; Picard, R.W. Personalized machine learning for robot perception of affect and engagement in autism therapy. Sci. Robot. 2018, 3, eaao6760. [Google Scholar] [CrossRef] [PubMed]

Figure 2. An overview of the proposed method. The proposed method is designed to answer the input question

Q_{n}

and generates a human-friendly explanation in natural language

W_{n}

based on the input image

I_{n}

. To provide additional information, we incorporated the multiple-image caption set

C_{n}

, the outside knowledge set

K_{n}

, and object information into the model. These sources are used to reference an array of helpful content during the generation process. The specific meanings of the symbols are recorded in Table 1.

Figure 2. An overview of the proposed method. The proposed method is designed to answer the input question

Q_{n}

and generates a human-friendly explanation in natural language

W_{n}

based on the input image

I_{n}

. To provide additional information, we incorporated the multiple-image caption set

C_{n}

, the outside knowledge set

K_{n}

, and object information into the model. These sources are used to reference an array of helpful content during the generation process. The specific meanings of the symbols are recorded in Table 1.

Figure 3. A sample of the dataset used in our experiments. Each image has a question and its explanation. “A1–A10” denotes the n-th answer, and “E1–E3” denotes the n-th explanation.

Figure 4. Word cloud of all questions in our dataset. The font size represents the frequency of occurrence in the dataset. The most frequent occurrences are “sport”, “animal”, and “room”.

Figure 5. A qualitative comparison between the proposed method and the state-of-the-art CM4 method. The results demonstrate that our model can generate more reasonable explanations by leveraging helpful information from multiple references. Each image in the samples is accompanied by the references generated using our model. The correct answers are indicated in parentheses, and we used the purple section to mark the valuable information for the generation process. “OK” indicates the outside knowledge in this figure.

Table 1. A cross-reference table of the symbols used in the experiments with their meanings.

Symbol	Models
$E_{L}$	Language encoder
$E_{V}$	Vision encoder
$G_{C}$	Image captioning model
$G_{O}$	Object recognition model
R	Outside knowledge retrieval model
$g_{L}$	MLP for language features
$g_{I}$	MLP for image feature
D	Decoder

Table 2. The comparative methods (CMs) in our experiments for the quantitative evaluation, and the proposed methods (PMs) with different settings for the ablation study.

Method	Overview
CM1 [17]	The questions are answered directly without reference to the image information, and the corresponding explanations are generated. The model uses a separate question-answering model and an explanation generation model.
CM2 [17]	The initial study primarily concerned generating natural language justifications for complex visual reasoning tasks, such as VQA, visual–-textual entailment, and commonsense reasoning. It has separate question-answering models and explanation generation models.
CM3 [18]	The method serves as a benchmark for evaluating explainable vision–language tasks. It introduces a unified evaluation framework and comprehensively compares existing approaches that generate natural language explanations for vision–language tasks.
CM4 [19]	A language model that is both compact and general in nature and is faithful to the underlying data. This model can predict answers and provide explanations simultaneously. The model introduces large-scale vision and language models to generate explanations and achieve good performance.
PM1 [26]	The proposed method only introduces multiple image captions as an additional reference.
PM2	The proposed method only introduces outside knowledge as an additional reference.
PM3	The proposed method only introduces object information as an additional reference.
PM4	The proposed method used four different MLPs with small hidden space sizes to handle different types of reference information.
PM5	The proposed method uses concatenation rather than summation in extracting captions and outside knowledge features from the set.

Table 3. Quantitative comparisons of our model with other state-of-the-art models using explanation metrics that evaluate word-level accuracy on the VQA-X dataset. Acc. refers to the accuracy of the model in answering the question, with a higher score indicating better performance.

	Explanation								Answer
	BLEU1	BLEU2	BLEU3	BLEU4	ROUGE-L	METEOR	CIDEr	SPICE	Acc.
CM1	59.3	44.0	32.3	23.9	47.4	20.6	91.8	17.9	63.4
CM2	59.4	43.4	31.1	22.3	46.6	20.1	84.4	17.3	62.7
CM3	52.6	36.6	24.9	17.2	40.3	19.0	51.8	15.7	56.8
CM4	64.0	48.6	36.4	27.2	50.7	22.6	104.7	21.6	67.1
PM	65.0	50.0	38.0	28.8	51.4	23.3	110.0	22.0	69.5

Table 4. Quantitative comparisons of our model with other state-of-the-art models on the e-SNLI-VE dataset.

	Explanation								Answer
	BLEU1	BLEU2	BLEU3	BLEU4	ROUGE-L	METEOR	CIDEr	SPICE	Acc.
CM3	30.1	19.9	13.7	9.6	27.8	19.6	85.9	34.5	79.5
CM4	35.7	24.0	16.8	11.9	33.4	18.1	114.7	32.1	-
PM	36.4	24.6	17.1	12.2	33.9	18.4	117.2	32.5	74.2

Table 5. Ablation study for feature fusion on the VQA-X dataset. We used four small MLPs for different types of reference information separately and used concatenation when obtaining captions and outside knowledge features in Equations (2) and (4).

	Explanation								Answer
	BLEU1	BLEU2	BLEU3	BLEU4	ROUGE-L	METEOR	CIDEr	SPICE	Acc.
PM1	64.0	49.1	37.2	28.3	51.3	23.1	109.8	21.3	68.5
PM2	63.4	48.6	36.8	28.2	51.1	22.7	107.2	21.1	66.4
PM3	63.4	48.6	36.6	27.4	51.1	22.5	106.2	21.1	66.5
PM4	64.4	49.0	36.6	27.2	50.9	22.6	104.0	20.9	66.5
PM5	64.4	49.1	37.0	27.8	51.2	23.0	105.3	21.7	67.0
PM	65.0	50.0	38.0	28.8	51.4	23.3	110.0	22.0	69.5

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhu, H.; Togo, R.; Ogawa, T.; Haseyama, M. Multimodal Natural Language Explanation Generation for Visual Question Answering Based on Multiple Reference Data. Electronics 2023, 12, 2183. https://doi.org/10.3390/electronics12102183

AMA Style

Zhu H, Togo R, Ogawa T, Haseyama M. Multimodal Natural Language Explanation Generation for Visual Question Answering Based on Multiple Reference Data. Electronics. 2023; 12(10):2183. https://doi.org/10.3390/electronics12102183

Chicago/Turabian Style

Zhu, He, Ren Togo, Takahiro Ogawa, and Miki Haseyama. 2023. "Multimodal Natural Language Explanation Generation for Visual Question Answering Based on Multiple Reference Data" Electronics 12, no. 10: 2183. https://doi.org/10.3390/electronics12102183

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multimodal Natural Language Explanation Generation for Visual Question Answering Based on Multiple Reference Data

Abstract

1. Introduction

2. Related Works

2.1. Visual Question Answering

2.2. Natural Language Explanation Generation for the Visual Question Answering

2.3. Large-Scale Language Models

3. Multimodal Explanation Generation Model Based on Multiple Reference Information

3.1. Generation of Multiple References

3.1.1. Generation of Multiple Image Captions

3.1.2. Generation of Outside Knowledge

3.1.3. Generation of Object Information

3.2. Answer and Explanation Generation

4. Experiments

4.1. Experimental Settings

4.2. Evaluation

4.3. Experimental Results

4.3.1. Quantitative Analysis

4.3.2. Qualitative Analysis

5. Discussions

5.1. Limitations

5.2. Future Works

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI